关键词 > STAT318/STAT462

STAT318 / STAT462 -19S2 (C) Data Mining

发布时间:2022-10-21

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Mathematics and Statistics

EXAMINATION

End-of-year Examinations, 2019

STAT318 / STAT462 -19S2 (C) Data Mining

1.   (a) Suppose that we have the following 100 market basket transactions.

Transaction

Frequency

{apple}

{apple, carrot}

{apple, banana, carrot}           {apple, banana, grape}            {apple, banana, carrot, orange} {banana, grape}

{carrot, orange}

{apple, grape, orange}

10

10

21

27

11

3

11

7



For example, there are 10 transactions of the form {apple, carrot}

i.  Compute the support of {orange}, {apple, banana}, and {apple, banana, orange}.

ii.  Compute the confidence of the association rules:

{apple, banana} → {orange};  and

{orange} → {apple, banana}.

Is confidence a symmetric measure? Justify your answer.

iii.  Find the 3-itemset(s) with the largest support.

iv.  If minsup = 0.1, is {carrot, orange} a maximal frequent itemset? Justify your answer.

v.  Lift is defined as

s(X n Y)

Lift(X Y) =

where s( ) denotes support. What does it mean if Lift(X → Y) = 1.

(b) This question examines linear discriminant analysis (LDA) and quadratic discrimi-

nant analysis (QDA) for a 3-class classification problem.

i.  Explain the difference between LDA and QDA.

ii.  Briefly describe Bayes classifier and the Bayes error rate.

iii.  Under what conditions does the testing error rate for QDA equal Bayes error rate?

2.   (a)  Describe two potential advantages of regression trees over other statistical learning methods.

(b) When growing a regression tree using CART, two types of splits are considered. Describe these splits and provide an example for each.

(c) A regression tree has three types of nodes:  the root node,  internal nodes and terminal nodes. Describe each node and explain how predictions are made using a regression tree.

(d)  Large bushy regression trees tend to over-fit the training data. Briefly explain what is meant by over-fitting and under-fitting the training data using regression trees.

(e) The predictive performance of a single regression tree can be substantially improved

by aggregating many decision trees.

i.  Briefly explain the method of bagging regression trees.

ii.  Explain the difference between bagging and random forest.

iii.  Briefly explain two differences between boosted regression trees and random forest.

3.   (a)  Using one or two sentences, explain the main difference between regression and classification problems.

(b) The expected test MSE, for a given x0 , can be decomposed into the sum of three

fundamental quantities:

E[y0 − fˆ(x0 )]2  = V (fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + V (∈). Briefly explain each of these three quantities.

(c)  Provide a sketch typical of training error, testing error, and the irreducible error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the exibility and the y-axis should represent the error. Make sure the plot is clearly labelled.  Explain why each of the three curves has the shape displayed in your plot.

(d)  Describe two situations where we would generally expect the testing MSE of an inflexible statistical learning method to be better than a flexible method.

(e) Would we generally expect the training MSE of a exible statistical learning method to be better or worse than an inflexible method? Why?

4.   (a)  Using one or two sentences, explain the difference between supervised learning and unsupervised learning.

(b) Suppose that we have five points, x1 , . . . , x5 , with the following dissimilarity matrix:

x1         x2         x3         x4         x5

0.45

0.53

0.56

0

0.24

For example, the dissimilarity between x1 and x2  is 0.9 and the dissimilarity between x3  and x5  is 0.15.

i.  Briefly explain the agglomerative hierarchical clustering algorithm.

ii.  Using the dissimilarity matrix above, sketch the dendrogram that results from hierarchically clustering these points using single linkage.  Clearly label your dendrogram and include all merging dissimilarities.

iii. Suppose we want a clustering with two clusters.  Which points are in each cluster for single linkage?

iv.  Repeat parts ii. and iii. using complete linkage.

v.  Describe one disadvantage of agglomerative hierarchical clustering.

(c)  Describe one advantage and one disadvantage of the k-means clustering algorithm.