关键词 > COMP3670/6670

COMP3670/6670 Introduction to Machine Learning Semester 2, 2020 Final Exam

发布时间：2022-11-06

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP3670/6670 Introduction to Machine Learning

Semester 2, 2020

Final Exam

● Section 1. Linear Algebra and Matrix Decomposition (13 points)

1. (6 points) Let {v1 , v2 } be linearly independent vectors in Rn . Let v3 be a vector in Rn that does not lie in the span of v1 , v2 . Prove that {v1 , v2 , v3 } is linearly independent.

Solution. Assume for a contradiction that {v1 , v2 , v3 } are not linearly independent. Then there exists constants c1 , c2 , c3 , not all zero, such that

c1v1 + c2v2 + c3v3 = 0

Now, if c3 = 0, then the above equation becomes

c1v1 + c2v2 = 0

and therefore {v1 , v2 } are linearly dependent, a contradiction. So, we have that c3 0. Hence

c1v1 + c2v2 + c3v3 = 0

c1v1 + c2v2 = _c3v3

and hence v3 lies in the span of v1 , v2 , a contradiction.

2. (7 points) Consider the matrix

┐

Find its eigenvalues. What does this matrix geometrically do when applied to a vector? Explain how this relates to the set of eigenvalues for this matrix.

Solution. We form the characteristic equation det(A _ λI)

det(A _ λI) = det ┐ = λ2 _ (_1)(1) = λ2 + 1

Setting it equal to zero and solving,

λ2 = 1 ÷ λ = 土^_1

We ﬁnd no eigenvalues, as we cannot square root a negative number (at least not in the real numbers). This matrix can be seen to perform a rotation of 90 degrees about the origin,

┐ ┌ ┐y(x) = ┌ _xy ┐

which means that no vector in R2 will be transformed into a scalar multiple of itself. This is why we see no real eigenvalues.

● Section 2. Analytic Geometry and Vector Calculus (12 points)

1. (6 points) Find all matrices T e R2←2 such that for any u e R2 ,

T (u) . u = 0

Solution. Let

T = ┌c(a) d(b)┐ , v = ┌ ┐y(x) .

Then,

T (v) . v = ┌c(a) d(b)┐ ┌ ┐y(x) . ┌ ┐y(x) = 0

┐ . ┌ ┐y(x) = 0

(ax + by)x + (cx + dy)y = 0

Choose x = 0, then

(a0 + by)0 + (c0 + dy)y = 0 ÷ dy2 = 0

Since dy2 = 0 needs to hold for all y, this is only possible when d = 0. Now, choose y = 0, then

(ax + b0)x + (cx + d0)0 = 0 ÷ ax2 = 0

Since ax2 = 0 needs to hold for all x, this is only possible when a = 0. Substituting this into the above,

(0x + by)x + (cx + 0y)y = 0

bxy + cxy = 0

Since this needs to hold for all x and for all y, it must be the case that b = _c. Hence,

{T e R2←2 : Av e R2 , (Tv) . v = 0} = ┐ : α e R}

2. (6 points) Let z, a e Rn←1, and deﬁne f : Rn←1 → R as f(z) = zT aaT z

Compute V∈f(z).

Solution. There are a few ways to do this. Students could take the derivative with respect

to a component of x, or could use product rule, or could recognize that zT aaT z = zT azT a = (zT a)2

and use chain rule. I will use the latter approach. Let h(x) = xT a and g(x) = x2 . Then f(x) = g(h(x)), and so

df dg dh

dx dh dx

Then, = 2h and = aT , by identities provided in lecture slides. Hence,

as required.

● Section 3. Probability (15 points)

Consider the following scenario. I ﬂip a fair coin.

If the coin comes up heads, I roll a fair 4 sided die (with sides {1, 2, 3, 4}), and then I tell you the result of rolling the die.

If the coin comes up tails, I roll a fair 6 sided die (with sides {1, 2, 3, 4, 5, 6}), and then I tell you the result of rolling the die.

Let X denote the number I tell you.

1. (3 points) What is the set of all possible outcomes x for X?

Solution. We either roll a 4-sided die, and get a number from 1 to 4, or roll a 6-sided die, and get a number from 1 to 6. Hence x = {1, 2, 3, 4, 5, 6}.

2. (4 points) Compute P (X = x) for all x e x .

Solution. First consider the case where x e {1, 2, 3, 4}. Let C denote the outcome of the coin, with c = {heads, tails} denoting the set of outcomes for the coin.

P (X = x) = P (X = x | C = c)P (C = c)

c=C

= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails) = 1/4 . 1/2 + 1/6 . 1/2 = 5/24

Then consider the case where x e {5, 6}, which will be the same as before, but now P (X = x | C = heads) = 0, as the four-sided die can’t roll 5’s or 6’s.

P (X = x) = P (X = x | C = c)P (C = c)

c=C

= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails) = 0 . 1/2 + 1/6 . 1/2 = 1/12

Hence,

P (X = x) =

x e {1, 2, 3, 4}

x e {5, 6}

3. (4 points) I tell you that X = 1. How likely is it that the coin ﬂipped heads?

Solution. We apply Bayes rule.

P (C = heads | X = 1)

P (X = 1 | C = heads)P (C = heads)

P (X = 1)

= = 3/5

4. (4 points) We repeat the above experiment, but this time whatever die is selected, is rolled twice. I inform you that the outcome for both rolls was a 1. How likely is it that the coin ﬂipped was heads?

Solution. Each die roll is i.i.d.

P (rolled one twice | C = heads)P (C = heads)

P (rolled one twice)

Now, P (rolled one twice | C = heads) = P (X = 1 | C = heads)2 = 1/16, and P (rolled one twice | C = tails) = P (X = 1 | C = tails)2 = 1/36

P (rolled one twice)

= P (rolled one twice | C = heads)P (C = heads) + P (rolled one twice | C = tails)P (C = tails = P (X = 1 | C = heads)2 (1/2) + P (X = 1 | C = tails)2 (1/2)

= 1/16 . 1/2 + 1/36 . 1/2 = 13/288

Hence,

P (C = heads | rolled one twice) = = 9/13

● Section 4. Clustering and Gaussian Mixture Model (GMM) (15 points)

Both Kmeans and GMM can be viewed as aiming to ﬁnd 9 to optimise p(疋 |9). Here, 疋 is the dataset, and 9 is related to the model. Answer the following questions.

1. (2 points) In kmeans, use no more than 2 sentences to describe what 9 contains.

Solution. 9 contains the coordinates of centroids, as well as the cluster ID that each sample is assigned to. Deduct 0.5 mark if answering the number of clusters.

2. (3 points) In kmeans, use no more than 2 sentences to describe the probabilistic meaning of p(疋 |9).

Solution. Given model parameters 9, the probability that dataset 疋 is generated by this model.

3. (3 points) Assume that samples in 疋 are from 3 classes. After training a GMM with 3 components on 疋 , we use this GMM as a classiﬁer to predict which class a new sample z belongs to. In no more than 3 sentences, describe the prediction process. (Use math symbols where relevant; you do not have to explain the symbols if they are same with lecture slides, e↓g↓ | u).

Solution. We obtain three Gaussian distributions from training. We compute p(z|9k), the probability that the test sample z is generated by the kth Gaussian distribution. We pick the Gaussian distribution that gives the largest probability as the label of this test sample.

4. (3 points) Is it correct to say that the kmeans method enables us to ﬁnd 9 that minimises p(疋 |9)? Explain your answer in 2 sentences.

Solution. Incorrect. First, we aim to maximise instead of minimise p(疋 |9). Second, kmeans enables us to ﬁnd a local maximum instead of a global one.

5. (4 points) Suppose we have the following 10 data points. They are partitioned into two classes, red and blue. Which model could generate this partition, kmeans only, GMM only, both, or neither? Explain your answer in three sentences. (Note: where GMM is relevant, the classiﬁcation principal is similar to Question 3 above, except that this question has two classes.)

Figure 1: 10 data points divided into two classes, blue and red.

Solution. GMM only. It cannot be generated by kmeans because of the right most point is closer to the center of the red class, so it will be classiﬁed as red. GMM can generate this ﬁgure because the blue class has a larger variance under GMM.

● Section 5. Linear Regression (13 points)

You are doing a machine learning internship at a bank, analysing user age and their daily expense. You collected seven samples and plotted them in Fig. 2(a).

90 90

Point A

80 80

70 70

60 60

50 50

40 40

3020 25 30 35 40 45 50 20 25 30 35 40 45 50

age age

(a) Regression with normal data (b) Regression with normal and abnormal data

Figure 2: Linear regression with normal and abnormal data.

1. (3 points) From your observation of Fig. 2(a), use 1 sentence to describe the relationship between user age and expense.

Solution. Expense increases linearly as age increases.

After obtaining Fig. 2(a), you collected a new data point A. Together with the previous seven points, you do linear regression for a second time and obtain the dashed line in Fig. 2(b).

2. (4 points) Generally when adding new samples to the dataset, it is expected that the ﬁtted line will be diﬀerent. In our example, there is quite a large diﬀerence between the new model (dashed line) and the old model (solid line). In two sentences, explain why the change is large. (Hint: you don’t have to explain why there is a “change” . Focus on “large” .)

Solution. 1) Point A deviates a lot from the other points, thus rendering a large gradi- ent/loss asking the model to be closer to it. 2) There are limited number of samples in this dataset, so the big gradient from A causes the model to move a lot.

3. (6 points) You originally used the squared error, i↓e↓ | for the nth training sample (北n, yn), l(北n, yn , 9) = (yn _ 9T 北n)2 ,

where 北n is the feature of the sample, yn is the label, and 9 contains the model parameters. Your supervisor tells you that Point A is an outlier and that it is best to exclude its impact on your model. Write down an amended loss function that can achieve this goal. Explain how it excludes the impact of outliers on your linear model. Note: you will get partial marks if your loss function can merely alleviate the impact of A.

Solution.

l(北n, yn , 9) = min((yn _ 9T 北n)2 _ m, 0).

m is a hyperparameter. From Fig. 1, its value should be approximately between 100 and 400 (it’s not necessary for a student to make this estimation). There could be other equivalent ones.

Explanation. For a proper value of m, the outlier point A will have a large (yn _ 9T 北n)2 , thus causing term (yn _ 9T 北n)2 _ m to be greater than 0. In this case, there will be 0 loss value. For other points, (yn _ 9T 北n)2 _ m will be less than 0, thus preserving the value after the min operator.

● Section 6. Principal Component Analysis (PCA) and Linear Regression (20 points)

We are given a centered dataset, (北1, y1 ), (北2, y2 ), ..., (北N, yN ), where 北n e R, yn e R, and N is the number of samples. “Centered” means n(N)=1 北n = 0, and n(N)=1 yn = 0. Now for this dataset, we apply linear regression and PCA. For linear regression, our model is y = θ北 +θ0 = 9z, where we treat yn as labels and 北n as the feature. For PCA, we obtain the ﬁrst and second principal components: pc1 and pc2 . An example is shown in Fig. 5.

-1

-2

-3

-4

-5

Figure 3: Linear regression with normal and abnormal data.

1. (3 points) Are pc1 and pc2 orthogonal? Explain your answer in two sentences.