Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD699:  Data Mining for Business Analytics

Fall 2018

27NOV2018

Quiz 3

QUIZ #3:  Question Bank

For the next five questions, please refer to the following set of 15 transactions.  Each line represents one transaction.   Each  transaction represents a customer’s set of downloads during a virtual “trip” to the App Store.  Within each transaction, the apps are listed in the order in which they were downloaded (the order has no meaning).

1.    Bitmoji, Snapchat, Messenger, SoundCloud, Waze

2.   Bitmoji, Snapchat

3.   SoundCloud, Waze, Netflix

4.   Snapchat, YouTube, Wish

5.   Snapchat, Facebook, Google Chrome, Waze

6.   Bitmoji, Messenger, SoundCloud, Waze

7.   Bitmoji, WhatsApp, Spotify, Uber

8.   WhatsApp, Messenger, SoundCloud, Waze, Google Maps, Lyft

9.   Bitmoji

10. Bitmoji, Snapchat, Netflix

11.  Bitmoji, Snapchat, Twitter, SoundCloud, Facebook, Waze

12.  Messenger, Waze

13. Netflix, Amazon

14.  Gmail, Uber, Spotify

15.  Snapchat, Netflix, Spotify, Waze, Google Maps, Lyft

1.  What is the support for {Bitmoji, Snapchat}?

2. What is the confidence for IF {Snapchat, Facebook} THEN {Waze}?

3.  What is the support for {Messenger, SoundCloud, Waze}?

4.  What is the lift ratio for IF {Bitmoji}, THEN {SoundCloud, Waze}?

5.   What is the lift ratio for IF {Snapchat} THEN {Netflix}?

6.    Which  ONE  of the  following  statements  can  be  properly  concluded  based  on  the screenshot  of  the  association  rule  shown  below?     (The  left-hand  side  of  the  rule  is separated from the right-hand side of the rule by the => symbol).   Reading from left to right, the first value shown is the support for the rule, the second value is the confidence, and the third is the lift ratio.

a.   Someone who buys root vegetables and yogurt is three times more likely than a randomly-selected shopper from this dataset to buy tropical fruit.

b.   If  someone  buys  root  vegetables  or  yogurt,  there  is an approximately 31 percent chance that he or she will also buy tropical fruit.

c.   If someone buys root vegetables but does not also buy yogurt, there is only a .8 percent chance that he or she will buy tropical fruit.

d.   8.13 percent of all consumers in the dataset bought root vegetables, yogurt, and tropical fruit.

7.    The table below shows categorical values -- a 1 in a particular cell indicates that the store carries in the item in stock, whereas a 0 indicates that the item is not stocked by that store.  Given the information contained in the table below, what is the Jacquard coefficient between Boxborough and Dracut?

8.     Which of the following is a difference between k-means clustering and hierarchical clustering?

a.   In  k-means  clustering,  the  software  selects  the  numbers  of  clusters;  in hierarchical clustering, however, the number must be specified by the user.

b.   In  k-means  clustering,  the  records  must  first  be  labeled  with  cluster assignments; with hierarchical clustering, however, the algorithm makes the assignments.

c.   Hierarchical clustering involves an initial step in which records are randomly assigned to a particular clustering, whereas k-means clustering tends to be more orderly.

d.   With  hierarchical  clustering,  clusters  are  nested  within  one  another;  in k-means, this is not the case.

9.    Nothing  is  ever  ideal  in  data  mining,  especially  when  real-world  data  is  involved. However, which of the following statements describes an ideal clustering solution (if such a thing could exist)?

a.   A  solution  in  which  records  assigned  to any particular cluster were very different from other records in that same cluster.

b.   A solution in which the records were very similar to other records assigned to the same cluster, and also very different from records assigned to other clusters.

c.   A solution in which each record could remain in its own cluster, due to its independence and non-hierarchical nature.

d.  A solution that assigned similar records to different clusters, to  ensure the right degree of randomness in the outcome.

10.  For the dendrogram shown immediately above, which of the following would be true with a distance cutoff of 2.5?

a.   At that distance, Southern is still by itself (it has not yet joined any clusters).

b.   At that distance, the only cluster to have formed is New England and United.

c.   At that distance, there would be a total of two separate clusters formed.

d.  At that distance, Idaho will have formed a cluster only with Puget.

11.    Suppose  we  used  the  screenshot  below,  which  depicts Henry’s nearest neighbors, measured by Euclidean distance, based on answers to a personality survey, in order to build   a   recommendation   system.     Imagine   that   we   instruct  our  model  like  this: “Recommend  to  Henry  any  content  that  either  Ya,  Andrey,  or  Charlotte  have  liked, regardless of what the content actually is.”

What would we be doing here?

a.   User-based collaborative filtering.

b.   Item-based collaborative filtering.

c.   Content-based filtering.

d.  A/B testing.

12. Suppose you are engaging in a word analysis exercise.  You are given a list of five great novels,  and  then  you  are  asked  to  generate,  and  then  compare,  a  list  of  the  20 most-frequently used words in each of the books.  You try this exercise without removing stopwords,  and  then  a  second  time  with  the  stopwords  removed.    How  will  the  lists change?

a.   When you remove the stopwords, the meaning of the novels will become less clear.

b.   When you remove the stopwords, the total word count in each of the novels will increase.

c.   The  stopword  removal  will  be  meaningless  unless  it  is  accompanied  by stemming.

d.   When you leave the stopwords in, you should expect to see more overlap among the most commonly-used words in each novel.

13.     A  common  file  extension  in  data  mining  is  the   .csv,  whose  letters  stand  for “comma-separated values.”  If commas are used to create a boundary between values, then they are a:

a.   token.

b.   stopword.

c.   corpus.

d.   delimiter.

14.    Person A yells out “No trees allowed here, bro!” at another passenger on the Green Line between Arlington and Copley.  Wondering if a fight might break out, you take off your headphones and begin to listen.

Person B responds: “First, why were you looking at my monitor?  Focus on your own self. But anyway, why don’t you think I should use a tree here? I was just about to call the library function on rpart,”  replies the second passenger.

Person A comes back with, “Look at these outliers!  Classification trees are great -- please don’t  get  me  wrong,  I  love  them  --  but  when  data  includes  outliers,  the  results  are unreliable.”

Person  B  responds,  “You’re  talking  that  crazy  talk  again!   Trees  separate  records into different groups.  A huge outlier in either direction isn’t going to change that.  I’m going to use a classification tree in my model, and you’re not going to change that.”

Who is right here?

a.   Person A is correct.   Before  considering whether to use a tree model, the analyst should first check for outliers.   If outliers are present, they should either  be  removed,  or  another  classification  algorithm  should  be  chosen instead.

b.   Person B is correct.   Splits in tree models are built to separate records into different groups.  An outlier in either direction won’t change the reliability of the model.

c.   This all depends on the package used in R.  For someone using rpart, outliers, won’t be a problem, but for someone using the tree package, outliers will make the result unreliable.

d.   Technically, they’re both wrong.   Outliers will not impact a tree model, but missing values will render the model useless.   If a dataset contains missing values, it should not be used with a tree model.

15.     A  classification tree built with binary decision nodes has four leaves.   How  many decision nodes does it have?