关键词 > STAT318/STAT462

STAT318-20S2 (C) / STAT462-20S2 (C) Data Mining

发布时间：2022-10-21

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Mathematics and Statistics

EXAMINATION

End-of-year Examinations, 2020

STAT318-20S2 (C) / STAT462-20S2 (C) Data Mining

1. (a) Suppose that we have the following 100 market basket transactions.

Transaction

Frequency

{apple}

{apple, carrot}

{apple, banana, carrot} {apple, banana, carrot, orange} {banana, orange}

{apple, banana, orange}

{carrot, orange}

{banana, carrot, orange}

For example, there are 10 transactions of the form {apple, carrot}

i. Compute the support of {orange}, {banana, carrot}, and {banana, carrot, orange}.

ii. Compute the conﬁdence of the association rules: {banana, carrot} → {orange}; and

{orange} → {banana, carrot}.

Is conﬁdence a symmetric measure? Justify your answer.

iii. Find the 3-itemset(s) with the largest support.

iv. If minsup = 0.2, is {banana, carrot, orange} a maximal frequent itemset?

Justify your answer.

v. Lift is deﬁned as

s(X n Y)

Lift(X → Y) =

where s( ) denotes support. What does it mean if Lift(X → Y) < 1.

(b) This question examines linear discriminant analysis (LDA) and quadratic discrimi-

nant analysis (QDA) for a 2-class classiﬁcation problem.

i. Under what conditions are Bayes classiﬁer and LDA equivalent classiﬁers?

ii. Explain the diﬀerence between LDA and QDA.

iii. In general, as the sample size increases, do we expect the test prediction ac- curacy of QDA relative to LDA to improve, decline, or be unchanged? Why?

2. (a) Describe two potential advantages of regression trees over other statistical learning methods.

(b) This question uses the CART partition given below. The points in each box denote

the training data and the numbers next to them are their response values.

1 14

2 . 1 . 14.

.16 .10

X2 1 10

0 5

- 1 .3 . .4 .12

2 3 4 5 6 7 8 9 10

X 1

i. Sketch the decision tree corresponding to the CART partition.

ii. Predict the response values for the following observations using your tree, where x = (x1 , x2 ):

a = (5, 1); b = (3.5, 2); and c = (11, −2.)

iii. What is the training MSE for your tree?

(c) The predictive performance of a single regression tree can be substantially improved by aggregating many decision trees.

i. Brieﬂy explain the method of bagging regression trees.

ii. Explain the diﬀerence between bagging and random forest.

iii. Brieﬂy explain two diﬀerences between boosted regression trees and random forest.

3. (a) Using one or two sentences, explain the main diﬀerence between regression and classiﬁcation problems.

(b) The expected test MSE, for a given x0 , can be decomposed into the sum of three

fundamental quantities:

E[y0 − fˆ(x0 )]2 = V (fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + V (∈). Brieﬂy explain each of these three quantities.

(d) Provide a sketch typical of training error, testing error, and the irreducible error, on a single plot, against the ﬂexibility of a statistical learning method. The x-axis should represent the ﬂexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot.

(e) Indicate on your plot in part (d) where we would generally expect to ﬁnd a regression

tree with ﬁve splits, a random forest, and k-nearest neighbour regression with k = 3. Explain why you have placed these methods where you have.

(f) Would we generally expect the training MSE of an un-pruned regression tree to be better or worse than a pruned regression tree? Why?

4. (a) Suppose that we have ﬁve points, x1 , . . . , x5 , with the following distance matrix:

x1 x2 x3 x4 x5

For example, the distance between x1 and x2 is 6 and the distance between x3 and x5 is 8.

i. Brieﬂy explain the single linkage and complete linkage hierarchical clustering algorithms.

ii. Using the distance matrix above, sketch the dendrogram that results from hierarchically clustering these points using complete linkage. Clearly label your dendrogram and include all merging distances.

iii. Suppose we want a clustering with two clusters. Which points are in each cluster for complete linkage?

iv. Brieﬂy explain one strength and one weakness of hierarchical clustering.

(b) Consider applying the k-means clustering algorithm to two-dimensional points using

Euclidean distance with k = 2.

i. Brieﬂy explain the k-means clustering algorithm.

ii. After two passes of the main loop of the k-means algorithm, the following cluster assignment was obtained:

Cluster 1 = {(1, 2), (2, 3), (5, 2), (2, 1)}

Cluster 2 = {(6, 2), (5, 3), (7, 4)}.

Has k-means converged to a stable clustering? Justify your answer.

iii. Would we generally expect to get the same clustering if we run the k-means algorithm several times? Explain.

iv. Brieﬂy explain one strength and one weakness of the k-means clustering algo- rithm.