CSCI-UA 9473 - Introduction to Machine Learning Final III 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSCI-UA 9473 - Introduction to Machine Learning Final III
2022
Question 1 (Supervised Learning 25pts)
1. Indicate whether the following statements are true or false (5pts)
True / False True / False
True / False
True / False True / False True / False
True / False
True / False
True / False
LASSO is a parametric method
Suppose that we have a regularized linear regression model argmin|t _ Xβ|2(2) + λ|β|1 .-
Increasing λ will increase the variance and decrease the bias
Suppose that we have a regularized linear regression model argmin|t _ Xβ|2(2) + λ|β|1 .-
Increasing λ will increase the bias and decrease the variance
Ridge regression minimizes the squared /2 norm as a penalty on the regression weights Ridge regression minimizes the /2 norm as a penalty on the regression weights
At the ith iteration of online perceptron learning, you have a model h- and you receive a new instance x(i+1) . You find out that your current model misclassifies the instance as h- (i) (x(i+1)) = +1 when the actual label is t(i+1) = _1 . You update the model using the perceptron algorithm and get a new classifier hβ (i+1) . hβ (i+1) is guaranteed to classify x(i+1) correctly as _1
lal2(2) You are using a Maximum Margin classifier with a Gaussian kernel defined as e − o 2
for a classification problem. You find that the training accuracy is 0.97 but the
test accuracy is .65 . The test accuracy can be improved by increasing the kernel width σ
lal2(2) You are using a Maximum Margin classifier with a Gaussian kernel defined as e − o 2
for a classification problem. You find that the training accuracy is 0.97 but the
test accuracy is .65 . The test accuracy can be improved by decreasing the kernel width σ Consider the dataset shown in Fig. 1 . This dataset can be made separable by
a linear SVM using only two support vectors .
True / False
The LASSO formulation is more likely to to set some of the regression weights to zero than is Ridge
2. [6pts] You have a single hidden- layer neural network for a binary classification task. The input is x e RD , output y e R and the true label t e R . The forward propagation equations are
a[1] = W [1] x + b
z [1] = σ(a[1] )
y(x(i)) = z [1]
m
/ = _ t(i) log(y(x(i))) + (1 _ t(i)) log(1 _ y(x(i)))
i=1
(1)
(2)
(3)
(4)
a) Explain why minimizing the loss (4) corresponds to looking for the maximum likelihood estimator of the network.
b) Write the expression for as a matrix product of two terms .
c) Explain how to extend your result to more hidden layers .
3. [2pts] Why is it important to place non- linearities between the layers of neural networks?
4. [5pts] You want to perform a classification task. You are hesitant between two choices: Approach 1 and Approach 2. The only difference between these two approaches is the loss function that is minimized. Assume that x(i) e R and t(i) e {+1, _1}, i = 1, . . . , m are the ith example and output label in the dataset, respectively. f (x(i)) denotes the output of the classifier for the ith example . Recall that for a given loss /, you minimize the cost
J = /(f (x(i)), t(i))
(5)
As we mentioned, the only difference between approach 1 and approach 2 is the choice of the loss function:
/1 (f (x(i)), t(i)) = max {0, 1 _ t(i)f (x(i)) }
/2 (f (x(i)), t(i)) = log2 (1 + exp(_t(i)f (x(i))))
(6)
(7)
(a) Rewrite /2 in terms of the sigmoid function.
(b) You are given an example with t(i) = _1 . What value of f (x(i)) will minimize /2 ?
(c) Assume that an outlier (very far from the decision boundary but in the right class) is added to the dataset. How will that affect classifier (6) ? Why?
(d) You are given an example with t(i) = _1 . What is the greatest value of f (x(i)) that will minimize /1 ?
(e) You would like a classifier whose output can be interpreted as a probability. Which loss function is
better and why?
5. [5pts] Give the pseudo code for the one vs rest classifier.
6. [2pts] We consider the following kernel
k(x, y) = xy + x2 y2 +^xy + cos(x _ y)
is this a valid kernel? Why?
Question 2 (Unsupervised 20pts)
|
1 1.5 2 2.5 3 3.5
x (input)
Figure 1: Classification dataset used in Question 1
Figure 2: Unsupervised dataset used in Question 2.
1. Indicate whether the following statements are true or false (5pts)
True / False |
Both PCA and linear regression can be thought of as algorithms for minimizing a sum of squared errors |
True / False |
Implementing K-medoid is more expensive (in terms of computation) than implementing K-means |
True / False |
The A priori algorithm works by discarding item sets whose support is smaller than a given threshold |
True / False |
When merging subclusters, single linkage clustering favors subclusters whose combination will have the smallest diameter |
True / False |
In Factor Analysis, if we write the decomposition into latent factors as x = Wz , the factor loading matrix W is defined up to a rotation. |
True / False |
Applying Expectation Maximization on a Gaussian Mixture model can be interpreted as a probabilistic version of K-means |
2. Principal component analysis is a dimensionality reduction method that projects a dataset onto onto its most variable components . You are give the datasets shown in Fig. 2. Draw the first and second compo- nents on each plot (if you work with pen and paper, just label the subplots as a, b and c and sketch the corresponding directions on your piece of paper) . Explain how one can compute those components . [5pts]
3. Give the pseudo- code for K-means [5pts]. What happens when there is an empty cluster?
4. Explain how to merge the clusters in agglomerative clustering [3pts].
5. Explain the difference between Principal Component Analysis and Factor Analysis [2pts]
2022-07-01