闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI-UA 9473 - Introduction to Machine Learning Final III

2022

Question 1 (Supervised Learning 25pts)

1. Indicate whether the following statements are true or false (5pts)

True / False True / False

True / False

True / False True / False True / False

True / False

LASSO is a parametric method

Suppose that we have a regularized linear regression model argmin|t _ Xβ|2(2) + λ|β|1 .-

Increasing λ will increase the variance and decrease the bias

Suppose that we have a regularized linear regression model argmin|t _ Xβ|2(2) + λ|β|1 .-

Increasing λ will increase the bias and decrease the variance

Ridge regression minimizes the squared /2 norm as a penalty on the regression weights Ridge regression minimizes the /2 norm as a penalty on the regression weights

At the ith iteration of online perceptron learning, you have a model h- and you receive a new instance x(i+1) . You ﬁnd out that your current model misclassiﬁes the instance as h- (i) (x(i+1)) = +1 when the actual label is t(i+1) = _1 . You update the model using the perceptron algorithm and get a new classiﬁer hβ (i+1) . hβ (i+1) is guaranteed to classify x(i+1) correctly as _1

lal2(2) You are using a Maximum Margin classiﬁer with a Gaussian kernel deﬁned as e − o 2

for a classiﬁcation problem. You ﬁnd that the training accuracy is 0.97 but the

test accuracy is .65 . The test accuracy can be improved by increasing the kernel width σ

lal2(2) You are using a Maximum Margin classiﬁer with a Gaussian kernel deﬁned as e − o 2

for a classiﬁcation problem. You ﬁnd that the training accuracy is 0.97 but the

test accuracy is .65 . The test accuracy can be improved by decreasing the kernel width σ Consider the dataset shown in Fig. 1 . This dataset can be made separable by

a linear SVM using only two support vectors .

True / False

The LASSO formulation is more likely to to set some of the regression weights to zero than is Ridge

2. [6pts] You have a single hidden- layer neural network for a binary classiﬁcation task. The input is x e RD , output y e R and the true label t e R . The forward propagation equations are

a[1] = W [1] x + b

z [1] = σ(a[1] )

y(x(i)) = z [1]

/ = _ t(i) log(y(x(i))) + (1 _ t(i)) log(1 _ y(x(i)))

i=1

(1)

(2)

(3)

(4)

a) Explain why minimizing the loss (4) corresponds to looking for the maximum likelihood estimator of the network.

b) Write the expression for as a matrix product of two terms .

c) Explain how to extend your result to more hidden layers .

3. [2pts] Why is it important to place non- linearities between the layers of neural networks?

4. [5pts] You want to perform a classiﬁcation task. You are hesitant between two choices: Approach 1 and Approach 2. The only diﬀerence between these two approaches is the loss function that is minimized. Assume that x(i) e R and t(i) e {+1, _1}, i = 1, . . . , m are the ith example and output label in the dataset, respectively. f (x(i)) denotes the output of the classiﬁer for the ith example . Recall that for a given loss /, you minimize the cost

J = /(f (x(i)), t(i))

(5)

As we mentioned, the only diﬀerence between approach 1 and approach 2 is the choice of the loss function:

/1 (f (x(i)), t(i)) = max {0, 1 _ t(i)f (x(i)) }

/2 (f (x(i)), t(i)) = log2 (1 + exp(_t(i)f (x(i))))

(6)

(7)

(a) Rewrite /2 in terms of the sigmoid function.

(b) You are given an example with t(i) = _1 . What value of f (x(i)) will minimize /2 ?

(c) Assume that an outlier (very far from the decision boundary but in the right class) is added to the dataset. How will that aﬀect classiﬁer (6) ? Why?

(d) You are given an example with t(i) = _1 . What is the greatest value of f (x(i)) that will minimize /1 ?

(e) You would like a classiﬁer whose output can be interpreted as a probability. Which loss function is

better and why?

5. [5pts] Give the pseudo code for the one vs rest classiﬁer.

6. [2pts] We consider the following kernel

k(x, y) = xy + x2 y2 +^xy + cos(x _ y)

is this a valid kernel? Why?

Question 2 (Unsupervised 20pts)

1 1.5 2 2.5 3 3.5

x (input)

Figure 1: Classiﬁcation dataset used in Question 1

Figure 2: Unsupervised dataset used in Question 2.

1. Indicate whether the following statements are true or false (5pts)

True / False	Both PCA and linear regression can be thought of as algorithms for minimizing a sum of squared errors
True / False	Implementing K-medoid is more expensive (in terms of computation) than implementing K-means
True / False	The A priori algorithm works by discarding item sets whose support is smaller than a given threshold
True / False	When merging subclusters, single linkage clustering favors subclusters whose combination will have the smallest diameter
True / False	In Factor Analysis, if we write the decomposition into latent factors as x = Wz , the factor loading matrix W is deﬁned up to a rotation.
True / False	Applying Expectation Maximization on a Gaussian Mixture model can be interpreted as a probabilistic version of K-means

2. Principal component analysis is a dimensionality reduction method that projects a dataset onto onto its most variable components . You are give the datasets shown in Fig. 2. Draw the ﬁrst and second compo- nents on each plot (if you work with pen and paper, just label the subplots as a, b and c and sketch the corresponding directions on your piece of paper) . Explain how one can compute those components . [5pts]

3. Give the pseudo- code for K-means [5pts]. What happens when there is an empty cluster?

4. Explain how to merge the clusters in agglomerative clustering [3pts].

5. Explain the diﬀerence between Principal Component Analysis and Factor Analysis [2pts]