闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS M146, Spring 2022

Midterm Exam

1. [30 points total] General Conceptual Questions

For each of the statements below, state True or False. Explain your answer in 1-2 sentences. For every part below, 1 point for correct assertion and 1 point for correct explanation.

(a) [2 points] The likelihood function for linear regression with a squared error loss assumes that

the labels are distributed as per a Gaussian distribution with zero mean.

Answer: False, the mean for the distribution of labels is θT x.

(b) [2 points] The logistic regression loss scales linearly with the probability of misclassification.

Answer: False, it scales as ∝ −log p(x). where p(x) is the misclassification probability.

Answer: False, the derivative of sigmoid(z) is sigmoid(z)(1 − sigmoid(z)), which equals 1/4 at z = 0.

(d) [2 points] Consider a multi-class classification problem consisting of d input features and c classes. Then, the number of unknown parameters for logistic regression using a softmax is d + c. Answer: False, the total number of unknown parameters for softmax regression is cd.

(e) [2 points] For optimizing a convex function, stochastic gradient descent requires more gradient

updates than mini-batch gradient descent with batch size > 1 to achieve convergence. Assume a common learning rate that guarantees convergence for both algorithms.

Answer: True, stochastic gradient descent is more noisy and hence requires more gradient up- dates for convergence.

(f) [2 points] ℓ2 regularization can improve generalization of a ML model by reducing its bias.

Answer: False, ℓ2 regularization curbs overfitting for high variance models.

(g) [2 points] If we have m values for a hyperparameter of a classifier and we use k fold cross-

validation for hyperparameter tuning, then we train the classifier mk times from scratch. Answer: True, each hyperparameter value requires training the classifier k times.

(h) [2 points] If the gap between learning curves for the training and validation datasets is large, we

can reduce it by decreasing the complexity of our model class.

Answer: True, a high gap indicates overfitting.

(i) [2 points] Consider a training dataset for binary classification with n1 positive examples and n2 negative examples where n1 < n2 . Further, the test set consists of t1 positive examples and t2 negative examples with t1 > t2 . The test accuracy for a kNN classifier with k = n1 + n2 is . Answer: True, all examples in the test set are classified as negative by the classifier.

(j) [2 points] Both kNN and perceptrons can be used to learn decision boundaries that are non-linear

w.r.t. the input attributes.

Answer: True, for perceptrons, this requires using a non-linear basis function/feature map/kernalization.

(k) [2 points] A kNN classifier for k = 10 requires more memory at test time than k = 2.

Answer: False, both procedures require storing the entire training dataset and consume the same amount of memory.

(l) [2 points] The ID3 algorithm for constructing decision trees chooses to split on attributes with the highest entropy.

Answer: False, ID3 uses information gain to decide the splitting criteria.

(m) [2 points] The depth of a decision tree learned using ID3 algorithm can be larger than the number of training examples. Assume that the depth of the root node is 0.

Answer: False, each branch of a tree reduces the number of unclassified training examples by at least one.

(n) [2 points] If k(u, v) is a valid kernel function for some u, v ∈ Rd , then ck(u, v) is also a valid

kernel function where c is any real-valued scalar.

Answer: False, c ≥ 0 for the kernel function to be factorizable as ϕ(u)T ϕ(v) to be valid.

(o) [2 points] The cost of making predictions with a kernalized linear model is O(n) where n is the

size of the training dataset. Assume that the cost of evaluating a kernel function between any pair of points is constant.

Answer: True, the predictions from a kernalized linear model are given as hθ (x) = g(P βi k(x(i) , x)).

2. [10 points total] Linear Models

Consider the setting of binary classification with +1/-1 labels. We have access to a training dataset of n examples, {(x(1) ,y (1) ), (x(2) ,y (2) ), ··· , (x(n),y (n))}. To learn a binary classifier, we will consider a linear hypothesis class consisting of the following hypothesis:

hθ (x) = sign(θT x) (1)

Here, x ∈ Rd+1 and θ ∈ Rd+1 . We set x0 = 1 such that θ0 is the bias parameter and θ 1:d are the weight parameters. In this problem, we consider the following empirical risk:

R(θ) = log(1 + exp(−y(i)θT x(i))) (2)

To learn this classifier, we propose to further regularize the above risk and minimize the following regularized loss:

J(θ) = log(1 + exp(−y(i)θT x(i))) + λ∥θ1:d∥2(2) (3)

where λ > 0 is the regularization coefficient.

(a) [3 points] Derive a gradient update rule for minimizing J(θ) w.r.t. the weight parameters.

Assume the learning rate is α .

Answer: Gradient update rule: For all j = 1, 2, ··· ,d:

θj ← θj − α

∂θj

where:

= + 2λθj

Alternative correct answers will use vectorized notation,

θ 1:d ← θ1:d − α∇θ 1:d J(θ)

where:

∇θ 1:d J(θ) = + 2λθ1:d

(b) [3 points] Derive a gradient update rule for minimizing J(θ) w.r.t. the bias parameters. Assume

the learning rate is α .

Answer:

Gradient update rule:

θ0 ← θ0 − α

∂θ0

where:

∂θ0 = n 1 + exp(−y(i)θT x(i))

(c) [2 points] Assume we fix θ0 = 0 and choose a very large λ → ∞ . Let θ = denote the opti- mal solution obtained by minimizing J(θ) under these assumptions. What is the corresponding empirical risk R()? Explain your answer.

Answer: As λ → ∞, → 0 and R() = log 2.

(d) [2 points] Recall, in class, we studied the perceptron model with the same hypothesis as Eq.( 1). However, we defined the empirical risk for the perceptron differently, as:

Rperceptron (θ) = max(0, −y(i)θT x(i)) (4)

Consider a hypothesis with parameters θ ∗ that achieves full accuracy on the training dataset, i.e., hθ ∗ (x(i)) = y (i) for all i = 1, 2, ..n. Is Rperceptron (θ∗ ) > R(θ∗ )? Explain your answer.

Answer: No.

Rperceptron (θ∗ ) = 0 as y(i)θ ∗Tx(i) ≥ 0 ∀i

R(θ) > 0 for all values of θ (including θ ∗ ).

3. [10 points total] Decision Trees

Suppose you have the following data set of 10 instances to determine whether Alice rates a dish as a favorite.

Attribute

Label

Size

Taste

Temperature

Favorite

Yes

Table 1: Abbreviations: S: Small, L: Large, Sa: Salty, Sw: Sweet, So: Sour, H: Hot, C: Cold

The data sets consists of 3 features: Size, Taste, Temperature, and a corresponding label Favorite. For any log you used in this question, please use log2 .

[Hint: For correct calculations, you should not require any non-trivial log2 computations.]

(a) [3 points] Compute the information gain for the Favorite label and the Taste attribute.

Answer:

H(Favorite) = − log2 − log2 = 1 H(Favorite|Taste) = − × 0 − × log2 0.5 − × 0 = 0.4

I(Favorite,Taste) = H(Favorite) − H(Favorite|Taste) = 1 − 0.4 = 0.6

(b) [4 points] Suppose we picked Taste as the root of the tree. Construct the full decision tree using

the ID3 algorithm.

Answer:

We need to pick between Size and Temperature to decide the next splitting attribute for the branch Taste = Sw. All other branches have perfect classification accuracy and directly lead to the leaf node.

H(Favorite|Size,Taste = Sw) = − × 0 − × 0 = 0

H(Favorite|Temperature,Taste = Sw) = −1 × (−1) = 1

Hence, the next attribute to split is Size. The two resulting branches have classification accuracy and hence, we can stop.

Taste

Size

Yes

S L

Yes No

(c) [3 points] Write the predicted labels using the constructed tree for the following test set. Does the constructed tree give zero test error?

(Size,Taste,Temperature) = (L, So, H), with true label Favorite = Yes

(Size,Taste,Temperature) = (S, Sw, C), with true label Favorite = Yes

Answer: Predicted labels are Yes and Yes respectively. Yes, test error is zero.

4. [10 points total] Kernels

(a) [5 points] Let k(u, v) be a valid kernel function for u, v ∈ Rd and f : Rd → R be any real-valued

function. Prove or disprove that f(u)k(u, v)f(v) is a valid kernel function.

Answer: Yes, this is a valid kernel. Since k(u, v) is a valid kernel, it can be expressed as k(u, v) = ϕ(u)T ϕ(v). Now, consider the feature map ϕ′ (u) = f(u)ϕ(v) and definition of a valid kernel to finish the proof of the claim.