Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI 567, Summer 2022

Theory Assignment 1

Problem 1  Understanding Entropy

(36 points)

The goal of this problem is to gain an intuitive understanding of entropy and why it is a good criteria for growing decision trees. Recall from lecture that the entropy of a discrete distribution P over C classes is defined as:

H(P) = P(Y = k)logP(Y = k)

1.1  Consider a node where all samples belong to the same class i. We can write the probability distribution as follows,

P(Y = k) =

We will prove that P is a minimum entropy distribution.

(a) Prove that for any distribution P, the minimum possible entropy is 0. In other words, H(P) ≥ 0.       (6

points)

(b) Show that a node where all samples belong to the same class achieves minimum entropy.    (6 points)

(c) Explain why this is desirable for a decision tree leaf.                                                              (4 points)

1.2  Next consider a node with an equal number of samples from all classes. Here the probability of each label is the same.

P(Y = k) =

We’ll prove that P is a maximum entropy distribution for the 2-class case (C = 2). Clarification: For the remainder of the problem, consider an arbitrary 2-class distribution P.

(a) Define P1 = P(Y = 1) and P2 = P(Y = 2). Rewrite H(P) as a function of just P1 .                  (6 points)

(b) We want to find P such that H(P) is maximized. In this case we can just find a local maximum of H(P).

Calculate the values of P1 and P2 that maximize entropy.                                                     (10 points)

(c) Explain why nodes with high entropy are not desirable for decision trees.                            (4 points)

Problem 2  Decision Tree Pruning

(32 points)

Consider the following decision tree with features x1, x2, x3 taking on values {0, 1} (labelled in red) and pre- dictions y at each leaf node (labelled in blue). There are 13 validation examples. The number of validation examples within each class are noted in the table below each leaf (labelled in green).

x1

x4

y = 0

y = 1

2.1  Calculate the classification error over the validation set. Show your work.                             (6 points)

2.2  Calculate the classification error if you were to prune the x2 node. Show your work.             (7 points)

2.3  Calculate the classification error if you were to prune the x3 node. Show your work.             (7 points)

2.4  Calculate the classification error if you were to prune the x4 node. Show your work.             (7 points)

2.5  Based on classification error, should this tree be pruned? If so, which node should be pruned?          (5

points)

Problem 3  Nearest Neighbor Classification                                                      (20 points)

3.1  We mentioned that the Euclidean/L2 distance is often used as the default distance for nearest neighbor classification. It is defined as

E(xx) =  x − x 2(2) = (xd − xd(′))2

In some applications such as information retrieval, the cosine distance is widely used too. It is defined as

C(xx) = 1 −  = 1 −  ,

Show that, if data is normalized with unit L2 norm, that is, ∥x∥ = 1 for all x in the training and test sets, changing the distance function from the Euclidean distance to the cosine distance will NOT affect the near- est neighbor classification results.                                                                                                 (10 points)

3.2  Assume we have a dataset, each data x is a 100 dimensional binary vector, i.e. x ∈ {0, 1}100, and each x is assigned a label ∈ {0, 1}. You can assume that all data points are distinct, i.e. ∀xi, xj in the dataset, x xj .

(a) Can we have a decision tree to classify the dataset with zero classification error w.r.t.  their labels?

Explain your answer.                                                                                                               (5 points)

(b) Can we specify a 1-NN over the dataset to result in exactly the same classification as our decision tree?

Explain

your answer.

(5 points)

Problem 4   Multicollinearity in Ridge Regression                                           (12 points)

Ridge Regression: L2 regularization is often used to prevent overfitting of models. We discussed in lectures that mean squared loss (MSE) is used to solve regression problems. L2 regularization is used with MSE loss to keep a check on the magnitude of coefficients that we learn through regression.

Multicollinearity happens when independent variables in the regression model are highly correlated to each other. We will investigate how it affects our regression coefficients.

(a) We are given a dataset X that has only 1 feature.  X =  [X1] i.e.  Xnxp where p = 1.  Write the ridge regression loss function (MSE + regularization) for this problem. What is the value of β1 in terms of X

and Y ?

(1 point)

(b) We expand the feature set by adding k − 1 copies of this feature to our dataset as new columns to get a new dataset X = [X1X2X3 ...Xk] i.e. Xnxp where p = k now. We fit ridge regression again. Find the new coefficients β2 where β2 is a kx1 vector. Write the loss function and show your work. Summarize how

the magnitude compares with β1 in (a).

(4 points)

 (c) We will try to carry over our understanding to higher dimensional vectors. Consider a general dataset Xnxp where p > 1. Write the loss function and the value of β using the matrix notation as discussed in class.                                                                                                                                         (1 point)

(d) We augment the features in X with a copy of all features in the dataset.  X′  =  [XX] such that Xhas dimension nx2p, how will the value of β change? Write the new loss function and calculate new coeffi- cients. Show your work.                                                                                                           (4 points)

(e) Lasso and Multicollinearity :  Consider a scenario that we were using LASSO instead of ridge regres- sion. Can you predict how the new β will be different from old β when multicollinearity is introduced?

Give a brief summary. No proof or calculation required.                                                         (2 points)