Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD699:  Data Mining for Business Analytics

Summer II

07AUG2018

Quiz 3

QUIZ #3:  Question Bank

1.  Which of the following statements about k-means clustering is true?

a.   When performing a k-means clustering, you must specify the number of clusters that you want the model to create.

b.   With k-means clustering, the number of clusters will be automatically determined by the algorithm behind the kmeans() function.

c.   K-means clustering can only be used with categorical data.

d.   K-means clustering is a supervised learning task.

2.     What  distance  measure  is  used  to  determine  cluster-to-cluster  distance  with single-linkage clustering?

a.   With  single-linkage  clustering,  the  average  of all  the  records  in  the  two clusters is calculated first; then, a randomly-generated value determines the next cluster pairing.

b.   With single-linkage clustering, the distance measure used is the minimum distance between the nearest pair of records in the two clusters.

c.   Single-linkage   clustering   uses   an   algorithm   that   randomly   generates distances, and then measures the clusters by homogeneity.

d.  A single-linkage measuring criterion must be specified by the user each time.

3.   In a hierarchical agglomerative clustering process involving six total records, how many clusters would the model use at first?

a.   The model would start with one cluster containing all six records, and then slowly divide the records into a greater number of clusters.

b.   Such a model would start with three clusters, and then either contract or expand, depending on the characteristics of the data.

c.   The  model  would  start  with  six  clusters,  and  then  begin  to join records together  into  clusters.    At  each  step,  the  number  of  clusters  would  be reduced.

d.   The  model would only begin to form once the user specified the desired number of clusters in advance.

4.    In the dendrograms that we saw in class and in the textbook, the cutoff values can be seen on the .

a.   x-axis

b.   centroid convergence

c.   y-axis

d.   ordinary least squares.

5.    For an undirected network with three nodes, what is the maximum possible number of edges?

a.   3.

b.   3.5

c.   7.

d.   6.

6.        For a directed network with 7 nodes, how many total edges are possible?

a.   42

b.   21

c.   13

d.   7 total edges (but of varying strength).

7.   Which of the following describes a bidirectional, or undirected, connection?

a.   When Mary begins to follow Tim on Twitter, Tim gains one follower.  He does not necessarily have to follow Mary back in return.

b.   Tim can indicate through LinkedIn that Mary is a thought leader whose posts he would like to see.  Mary can see that Tim has indicated this.

c.   Tim can indicate through LinkedIn that Mary is a thought leader whose posts he  would  like to see.   Mary  cannot see that Tim has made this selection unless he chooses to share it publicly.

d.   When Mary sends a LinkedIn connection request to Tim, and Tim accepts, each of them gains a connection.   The impact to the network would be the same if Tim had initiated the connection request.

8.    In social network analysis, what is a singleton?

a.   A singleton is a user who is not connected to any other node.

b.   A singleton is a network in which each node is directly connected by an edge to exactly one other node.

c.   A  singleton  is  used  to  help  determine  the  path  length  between  two otherwise-unconnected nodes.

d.   In  social network analysis, a singleton is essentially the same thing as an unweighted edge.

9.   The table below shows categorical values -- a 1 in a particular cell indicates that the store carries in the item in stock, whereas a 0 indicates that the item is not stocked by that store.  Given the information contained in the table below, what is the Jacquard coefficient between Boxborough and Chelmsford?

a.   .33

b.   .67

c.   1.33

d.   .25

10.   In the dendrogram shown below, how many clusters will have formed at 1.0 units of distance?

a.   At 1.0 units of distance, two clusters have formed (New England-United, and Madison-Northern).

b.   At  1.0  units  of  distance,  there  are  four  clusters  (New  England-United, Madison-Northern, Oklahoma-Texas, and Arizona-Southern).

c.   At 1.0 units of distance, there is one total cluster.

d.  At 1.0 units of distance, no clusters will have formed yet.

11.  For the dendrogram shown immediately above, which of the following would be true about the number of clusters at a distance of 4.0?

a.   At a cutoff distance of 4.0, NY would still standby itself, but all of the other records would be part of one cluster (two clusters total).

b.   At a cutoff distance of 4.0, there would be 18 separate clusters in this model.

c.   At a cutoff distance of 4.0, Nevada, Puget, and Central would all be in separate clusters.

d.  At a cutoff distance of 4.0, all the records would have been formed into one large cluster.

12. Which of the following statements about the network shown below is true?

a.   This network is a clique, but not a connected network.

b.   This network is a clique and a connected network.

c.   This network is a connected network, but not a clique.

d.   This network is neither a clique nor a connected network.

13.   Suppose a telephone company is wondering what happened to a particular customer named John Doe.  This customer stopped paying his phone bill, and stopped responding to any correspondence from the phone company.  The phone company suspects that he may have resumed phone service under a different name and address.  How can the company use entity resolution to see if the mystery customer and a new customer are really the same person?

a.   They could look at the calling and text network of John Doe (whose identity they know).   This would tell them who John called, who called him, who he texted, who texted him, etc.   They could then compare that to the call and text networks of new customers to look for a match.

b.   They could build a diagram that shows each person John Doe had ever called or texted.   Then, they could check to see whether any of those people had recently canceled their service with the company.

c.   They could look to see whether any new subscribers had made suspicious inquiries with the telephone company.  Entity resolution would then identify those   suspicious   individuals,   and   the   company   could   look   to   make comparisons from there.

d.   The company could use entity resolution by identifying the call records of known criminals that John Doe had spoken with or texted in the past.  Then, they could adjust their model based on these patterns in order to find people that might know more about John Doe’s whereabouts.

14.  As part of the preprocessing for the creation of a text-mining model, an analyst decides to use stemming.  Which of the following might be accomplished in this step?

a.   Common  English-language words such as theirs, this, you’d, she, and what would all be removed from the document, in order to reduce it to the most essential terms.

b.   This will have made the text ready for Latent Semantic Indexing (LSI).

c.   The words ‘train’, ‘training’, ‘trainer’, and ‘trained’ would all be reduced to a single term, and would be treated the same by the model.

d.   Most of the terms -- except those that had multiple punctuation marks -- would become de-tokenized.

15.   What  is  a  potential flaw associated with a social network graph that ignores edge weight?

a.   Without a depiction of eigenvector centrality, the reader would not know whether this was an egocentric network.

b.   A  person  reading  such  a graph would not be able to see any meaningful information about the relative importance of the various connections in the network.

c.   Such a graph might lead a reader to believe that the network was directed, when it was actually undirected.

d.   When using a graph that does not show edge weight, time components that might have value to the network will be misrepresented.