COMP3670/6670 Introduction to Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP3670/6670 Introduction to Machine Learning
Semester 2, 2020
Final Exam
• Write your name and UID on the first page (you will be fine if you forget to write them).
• This is an open book exam. You may bring in any materials including electronic and paper-
based ones. Any calculators (programmable included) are allowed. No communication devices
are permitted during the exam.
• Reading time: 30 minutes
• Writing time: 180 minutes
• For all the questions, write your answer CLEARLY on papers prepared by yourself.
• There are totally 8 pages (including the cover page)
• Points possible: 100
• This is not a hurdle.
• When you are asked to provide a justification to your answer, if your justification is incorrect,
you will get 0.
• Section 1. Linear Algebra and Matrix Decomposition (13 points)
1. (6 points) Let {v1,v2} be linearly independent vectors in Rn. Let v3 be a vector in Rn that
does not lie in the span of v1,v2. Prove that {v1,v2,v3} is linearly independent.
Solution. Assume for a contradiction that {v1,v2,v3} are not linearly independent. Then
there exists constants c1, c2, c3, not all zero, such that
c1v1 + c2v2 + c3v3 = 0
Now, if c3 = 0, then the above equation becomes
c1v1 + c2v2 = 0
and therefore {v1,v2} are linearly dependent, a contradiction. So, we have that c3 6= 0.
Hence
c1v1 + c2v2 + c3v3 = 0
c1v1 + c2v2 = −c3v3
−c1
c3
v1 +
−c2
c3
v2 = v3
and hence v3 lies in the span of v1,v2, a contradiction.
2. (7 points) Consider the matrix [
0 −1
1 0
]
Find its eigenvalues. What does this matrix geometrically do when applied to a vector?
Explain how this relates to the set of eigenvalues for this matrix.
Solution. We form the characteristic equation det(A− λI)
det(A− λI) = det
[−λ −1
1 λ
]
= λ2 − (−1)(1) = λ2 + 1
Setting it equal to zero and solving,
λ2 = 1⇒ λ = ±√−1
We find no eigenvalues, as we cannot square root a negative number (at least not in the real
numbers). This matrix can be seen to perform a rotation of 90 degrees about the origin,[
0 −1
1 0
] [
x
y
]
=
[−y
x
]
which means that no vector in R2 will be transformed into a scalar multiple of itself. This
is why we see no real eigenvalues.
• Section 2. Analytic Geometry and Vector Calculus (12 points)
1. (6 points) Find all matrices T ∈ R2×2 such that for any v ∈ R2,
T (v) · v = 0
Solution. Let
T =
[
a b
c d
]
,v =
[
x
y
]
.
Then,
T (v) · v =
[
a b
c d
] [
x
y
]
·
[
x
y
]
= 0[
ax+ by
cx+ dy
]
·
[
x
y
]
= 0
(ax+ by)x+ (cx+ dy)y = 0
Choose x = 0, then
(a0 + by)0 + (c0 + dy)y = 0⇒ dy2 = 0
Since dy2 = 0 needs to hold for all y, this is only possible when d = 0. Now, choose y = 0,
then
(ax+ b0)x+ (cx+ d0)0 = 0⇒ ax2 = 0
Since ax2 = 0 needs to hold for all x, this is only possible when a = 0. Substituting this
into the above,
(0x+ by)x+ (cx+ 0y)y = 0
bxy + cxy = 0
Since this needs to hold for all x and for all y, it must be the case that b = −c. Hence,
{T ∈ R2×2 : ∀v ∈ R2, (Tv) · v = 0} =
{[
0 α
−α 0
]
: α ∈ R
}
2. (6 points) Let x,a ∈ Rn×1, and define f : Rn×1 → R as
f(x) = xTaaTx
Compute ∇xf(x).
Solution. There are a few ways to do this. Students could take the derivative with respect
to a component of x, or could use product rule, or could recognize that
xTaaTx = xTaxTa = (xTa)2
and use chain rule. I will use the latter approach. Let h(x) = xTa and g(x) = x2. Then
f(x) = g(h(x)), and so
df
dx
=
dg
dh
dh
dx
Then, dgdh = 2h and
dh
dx = a
T , by identities provided in lecture slides. Hence,
df
dx
= 2haT = 2xTaaT .
as required.
• Section 3. Probability (15 points)
Consider the following scenario. I flip a fair coin.
If the coin comes up heads, I roll a fair 4 sided die (with sides {1, 2, 3, 4}), and then I tell you
the result of rolling the die.
If the coin comes up tails, I roll a fair 6 sided die (with sides {1, 2, 3, 4, 5, 6}), and then I tell you
the result of rolling the die.
Let X denote the number I tell you.
1. (3 points) What is the set of all possible outcomes X for X?
Solution. We either roll a 4-sided die, and get a number from 1 to 4, or roll a 6-sided die,
and get a number from 1 to 6. Hence X = {1, 2, 3, 4, 5, 6}.
2. (4 points) Compute P (X = x) for all x ∈ X .
Solution. First consider the case where x ∈ {1, 2, 3, 4}. Let C denote the outcome of the
coin, with C = {heads, tails} denoting the set of outcomes for the coin.
P (X = x) =
∑
c∈C
P (X = x | C = c)P (C = c)
= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails)
= 1/4 · 1/2 + 1/6 · 1/2 = 5/24
Then consider the case where x ∈ {5, 6}, which will be the same as before, but now P (X =
x | C = heads) = 0, as the four-sided die can’t roll 5’s or 6’s.
P (X = x) =
∑
c∈C
P (X = x | C = c)P (C = c)
= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails)
= 0 · 1/2 + 1/6 · 1/2 = 1/12
Hence,
P (X = x) =
{
5/24 x ∈ {1, 2, 3, 4}
1/12 x ∈ {5, 6}
3. (4 points) I tell you that X = 1. How likely is it that the coin flipped heads?
Solution. We apply Bayes rule.
P (C = heads | X = 1)
=
P (X = 1 | C = heads)P (C = heads)
P (X = 1)
=
1/4 · 1/2
5/24
= 3/5
4. (4 points) We repeat the above experiment, but this time whatever die is selected, is rolled
twice. I inform you that the outcome for both rolls was a 1. How likely is it that the coin
flipped was heads?
Solution. Each die roll is i.i.d.
P (C = heads | rolled one twice) = P (rolled one twice | C = heads)P (C = heads)
P (rolled one twice)
Now, P (rolled one twice | C = heads) = P (X = 1 | C = heads)2 = 1/16, and
P (rolled one twice | C = tails) = P (X = 1 | C = tails)2 = 1/36
P (rolled one twice)
= P (rolled one twice | C = heads)P (C = heads) + P (rolled one twice | C = tails)P (C = tails
= P (X = 1 | C = heads)2(1/2) + P (X = 1 | C = tails)2(1/2)
= 1/16 · 1/2 + 1/36 · 1/2 = 13/288
Hence,
P (C = heads | rolled one twice) = 1/16 · 1/2
13/288
= 9/13
• Section 4. Clustering and Gaussian Mixture Model (GMM) (15 points)
Both Kmeans and GMM can be viewed as aiming to find θ to optimise p(X |θ). Here, X is the
dataset, and θ is related to the model. Answer the following questions.
1. (2 points) In kmeans, use no more than 2 sentences to describe what θ contains.
Solution. θ contains the coordinates of centroids, as well as the cluster ID that each
sample is assigned to. Deduct 0.5 mark if answering the number of clusters.
2. (3 points) In kmeans, use no more than 2 sentences to describe the probabilistic meaning
of p(X |θ).
Solution. Given model parameters θ, the probability that dataset X is generated by this
model.
3. (3 points) Assume that samples in X are from 3 classes. After training a GMM with 3
components on X , we use this GMM as a classifier to predict which class a new sample x
belongs to. In no more than 3 sentences, describe the prediction process. (Use math symbols
where relevant; you do not have to explain the symbols if they are same with lecture slides,
e.g., µ).
Solution. We obtain three Gaussian distributions from training. We compute p(x|θk), the
probability that the test sample x is generated by the kth Gaussian distribution. We pick
the Gaussian distribution that gives the largest probability as the label of this test sample.
4. (3 points) Is it correct to say that the kmeans method enables us to find θ that minimises
p(X |θ)? Explain your answer in 2 sentences.
Solution. Incorrect. First, we aim to maximise instead of minimise p(X |θ). Second, kmeans
enables us to find a local maximum instead of a global one.
5. (4 points) Suppose we have the following 10 data points. They are partitioned into two
classes, red and blue. Which model could generate this partition, kmeans only, GMM only,
both, or neither? Explain your answer in three sentences. (Note: where GMM is relevant,
the classification principal is similar to Question 3 above, except that this question has two
classes.)
Figure 1: 10 data points divided into two classes, blue and red.
Solution. GMM only. It cannot be generated by kmeans because of the right most point
is closer to the center of the red class, so it will be classified as red. GMM can generate this
figure because the blue class has a larger variance under GMM.
• Section 5. Linear Regression (13 points)
You are doing a machine learning internship at a bank, analysing user age and their daily
expense. You collected seven samples and plotted them in Fig. 2(a).
20 25 30 35 40 45 50
age
30
40
50
60
70
80
90
ex
pe
ns
e
20 25 30 35 40 45 50
age
30
40
50
60
70
80
90
ex
pe
ns
e
Point !
(a) Regression with normal data (b) Regression with normal and abnormal data
Figure 2: Linear regression with normal and abnormal data.
1. (3 points) From your observation of Fig. 2(a), use 1 sentence to describe the relationship
between user age and expense.
Solution. Expense increases linearly as age increases.
After obtaining Fig. 2(a), you collected a new data point A. Together with the previous seven
points, you do linear regression for a second time and obtain the dashed line in Fig. 2(b).
2. (4 points) Generally when adding new samples to the dataset, it is expected that the fitted
line will be different. In our example, there is quite a large difference between the new model
(dashed line) and the old model (solid line). In two sentences, explain why the change is
large. (Hint: you don’t have to explain why there is a “change”. Focus on “large”.)
Solution. 1) Point A deviates a lot from the other points, thus rendering a large gradi-
ent/loss asking the model to be closer to it. 2) There are limited number of samples in this
dataset, so the big gradient from A causes the model to move a lot.
3. (6 points) You originally used the squared error, i.e., for the nth training sample (xn, yn),
l(xn, yn,θ) = (yn − θTxn)2,
where xn is the feature of the sample, yn is the label, and θ contains the model parameters.
Your supervisor tells you that Point A is an outlier and that it is best to exclude its impact
on your model. Write down an amended loss function that can achieve this goal. Explain
how it excludes the impact of outliers on your linear model. Note: you will get partial marks
if your loss function can merely alleviate the impact of A.
2026-01-09