CS M146, Spring 2022 Midterm Exam
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS M146, Spring 2022
Midterm Exam
1. [30 points total] General Conceptual Questions
For each of the statements below, state True or False. Explain your answer in 1-2 sentences. For every part below, 1 point for correct assertion and 1 point for correct explanation.
(a) [2 points] The likelihood function for linear regression with a squared error loss assumes that
the labels are distributed as per a Gaussian distribution with zero mean.
Answer: False, the mean for the distribution of labels is θT x.
(b) [2 points] The logistic regression loss scales linearly with the probability of misclassification.
Answer: False, it scales as ∝ −log p(x). where p(x) is the misclassification probability.
(c) [2 points] The derivative of sigmoid(z) at z = 0 is 1/4.
Answer: False, the derivative of sigmoid(z) is sigmoid(z)(1 − sigmoid(z)), which equals 1/4 at z = 0.
(d) [2 points] Consider a multi-class classification problem consisting of d input features and c classes. Then, the number of unknown parameters for logistic regression using a softmax is d + c. Answer: False, the total number of unknown parameters for softmax regression is cd.
(e) [2 points] For optimizing a convex function, stochastic gradient descent requires more gradient
updates than mini-batch gradient descent with batch size > 1 to achieve convergence. Assume a common learning rate that guarantees convergence for both algorithms.
Answer: True, stochastic gradient descent is more noisy and hence requires more gradient up- dates for convergence.
(f) [2 points] ℓ2 regularization can improve generalization of a ML model by reducing its bias.
Answer: False, ℓ2 regularization curbs overfitting for high variance models.
(g) [2 points] If we have m values for a hyperparameter of a classifier and we use k fold cross-
validation for hyperparameter tuning, then we train the classifier mk times from scratch. Answer: True, each hyperparameter value requires training the classifier k times.
(h) [2 points] If the gap between learning curves for the training and validation datasets is large, we
can reduce it by decreasing the complexity of our model class.
Answer: True, a high gap indicates overfitting.
(i) [2 points] Consider a training dataset for binary classification with n1 positive examples and n2 negative examples where n1 < n2 . Further, the test set consists of t1 positive examples and t2 negative examples with t1 > t2 . The test accuracy for a kNN classifier with k = n1 + n2 is . Answer: True, all examples in the test set are classified as negative by the classifier.
(j) [2 points] Both kNN and perceptrons can be used to learn decision boundaries that are non-linear
w.r.t. the input attributes.
Answer: True, for perceptrons, this requires using a non-linear basis function/feature map/kernalization.
(k) [2 points] A kNN classifier for k = 10 requires more memory at test time than k = 2.
Answer: False, both procedures require storing the entire training dataset and consume the same amount of memory.
(l) [2 points] The ID3 algorithm for constructing decision trees chooses to split on attributes with the highest entropy.
Answer: False, ID3 uses information gain to decide the splitting criteria.
(m) [2 points] The depth of a decision tree learned using ID3 algorithm can be larger than the number of training examples. Assume that the depth of the root node is 0.
Answer: False, each branch of a tree reduces the number of unclassified training examples by at least one.
(n) [2 points] If k(u, v) is a valid kernel function for some u, v ∈ Rd , then ck(u, v) is also a valid
kernel function where c is any real-valued scalar.
Answer: False, c ≥ 0 for the kernel function to be factorizable as ϕ(u)T ϕ(v) to be valid.
(o) [2 points] The cost of making predictions with a kernalized linear model is O(n) where n is the
size of the training dataset. Assume that the cost of evaluating a kernel function between any pair of points is constant.
Answer: True, the predictions from a kernalized linear model are given as hθ (x) = g(P βi k(x(i) , x)).
2. [10 points total] Linear Models
Consider the setting of binary classification with +1/-1 labels. We have access to a training dataset of n examples, {(x(1) ,y (1) ), (x(2) ,y (2) ), ··· , (x(n),y (n))}. To learn a binary classifier, we will consider a linear hypothesis class consisting of the following hypothesis:
hθ (x) = sign(θT x) (1)
Here, x ∈ Rd+1 and θ ∈ Rd+1 . We set x0 = 1 such that θ0 is the bias parameter and θ 1:d are the weight parameters. In this problem, we consider the following empirical risk:
R(θ) = log(1 + exp(−y(i)θT x(i))) (2)
To learn this classifier, we propose to further regularize the above risk and minimize the following regularized loss:
J(θ) = log(1 + exp(−y(i)θT x(i))) + λ∥θ1:d∥2(2) (3)
where λ > 0 is the regularization coefficient.
(a) [3 points] Derive a gradient update rule for minimizing J(θ) w.r.t. the weight parameters.
Assume the learning rate is α .
Answer: Gradient update rule: For all j = 1, 2, ··· ,d:
θj ← θj − α
∂θj
where:
= + 2λθj
Alternative correct answers will use vectorized notation,
θ 1:d ← θ1:d − α∇θ 1:d J(θ)
where:
∇θ 1:d J(θ) = + 2λθ1:d
(b) [3 points] Derive a gradient update rule for minimizing J(θ) w.r.t. the bias parameters. Assume
the learning rate is α .
Answer:
Gradient update rule:
θ0 ← θ0 − α
∂θ0
where:
∂θ0 = n 1 + exp(−y(i)θT x(i))
(c) [2 points] Assume we fix θ0 = 0 and choose a very large λ → ∞ . Let θ = denote the opti- mal solution obtained by minimizing J(θ) under these assumptions. What is the corresponding empirical risk R()? Explain your answer.
Answer: As λ → ∞, → 0 and R() = log 2.
(d) [2 points] Recall, in class, we studied the perceptron model with the same hypothesis as Eq.( 1). However, we defined the empirical risk for the perceptron differently, as:
Rperceptron (θ) = max(0, −y(i)θT x(i)) (4)
Consider a hypothesis with parameters θ ∗ that achieves full accuracy on the training dataset, i.e., hθ ∗ (x(i)) = y (i) for all i = 1, 2, ..n. Is Rperceptron (θ∗ ) > R(θ∗ )? Explain your answer.
Answer: No.
Rperceptron (θ∗ ) = 0 as y(i)θ ∗Tx(i) ≥ 0 ∀i
R(θ) > 0 for all values of θ (including θ ∗ ).
3. [10 points total] Decision Trees
Suppose you have the following data set of 10 instances to determine whether Alice rates a dish as a favorite.
# |
Attribute |
Label |
||
Size |
Taste |
Temperature |
Favorite |
|
1
2
3
4
5
6
7
8
9
10 |
S
L
L
S S
L
L
S S
L |
Sa Sw Sw So So Sa So Sw Sw Sa |
H
C C C
H
H
H
C C
H |
No
No
No Yes Yes No Yes Yes Yes No |
Table 1: Abbreviations: S: Small, L: Large, Sa: Salty, Sw: Sweet, So: Sour, H: Hot, C: Cold
The data sets consists of 3 features: Size, Taste, Temperature, and a corresponding label Favorite. For any log you used in this question, please use log2 .
[Hint: For correct calculations, you should not require any non-trivial log2 computations.]
(a) [3 points] Compute the information gain for the Favorite label and the Taste attribute.
Answer:
H(Favorite) = − log2 − log2 = 1 H(Favorite|Taste) = − × 0 − × log2 0.5 − × 0 = 0.4
I(Favorite,Taste) = H(Favorite) − H(Favorite|Taste) = 1 − 0.4 = 0.6
(b) [4 points] Suppose we picked Taste as the root of the tree. Construct the full decision tree using
the ID3 algorithm.
Answer:
We need to pick between Size and Temperature to decide the next splitting attribute for the branch Taste = Sw. All other branches have perfect classification accuracy and directly lead to the leaf node.
H(Favorite|Size,Taste = Sw) = − × 0 − × 0 = 0
H(Favorite|Temperature,Taste = Sw) = −1 × (−1) = 1
Hence, the next attribute to split is Size. The two resulting branches have classification accuracy and hence, we can stop.
Taste
Sa
No
Sw
Size
So
Yes
S L
Yes No
(c) [3 points] Write the predicted labels using the constructed tree for the following test set. Does the constructed tree give zero test error?
(Size,Taste,Temperature) = (L, So, H), with true label Favorite = Yes
(Size,Taste,Temperature) = (S, Sw, C), with true label Favorite = Yes
Answer: Predicted labels are Yes and Yes respectively. Yes, test error is zero.
4. [10 points total] Kernels
(a) [5 points] Let k(u, v) be a valid kernel function for u, v ∈ Rd and f : Rd → R be any real-valued
function. Prove or disprove that f(u)k(u, v)f(v) is a valid kernel function.
Answer: Yes, this is a valid kernel. Since k(u, v) is a valid kernel, it can be expressed as k(u, v) = ϕ(u)T ϕ(v). Now, consider the feature map ϕ′ (u) = f(u)ϕ(v) and definition of a valid kernel to finish the proof of the claim.
(b) [5 points] Given a symmetric p.s.d. matrix A ∈ Rd ×d , prove or disprove that k(u, v) = uT Av is
a valid kernel function for u, v ∈ Rd .
[Hint: Any symmetric p.s.d. matrix can be diagonalized as A = LLT where L is a lower triangular matrix.]
Answer: Yes, this is a valid kernel. Use the feature map ϕ(u) = LT u and definition of a valid kernel to finish the proof of the claim.
2022-06-05