COMP90051 Statistical Machine Learning 2022 Semester 2 – Final Exam
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP90051 Statistical Machine Learning
Final Exam
Semester 2, 2022
Total marks: 120
Students must attempt all questions
Section A: Short Answer Questions [40 marks]
Answer each of the questions in this section as briefly as possible. Expect to answer each question in 1-3 lines.
Question 1: [40 marks]
(a) How does the value of the learning rate, η in stochastic gradient descent affect the progress of training? [5 marks]
(b) Why are vanishing gradients a more pressing concern for recurrent neural networks (RNNs) than for other neural architectures? [5 marks]
(c) Explain how convolutional neural networks (CNNs) can produce translation invariant representa- tions when applied to vector or matrix inputs. [5 marks]
(d) Can the joint probability distribution described by any undirected probabilistic graphical model be expressed as a directed probabilistic graphical model? Explain why or why not. [5 marks]
(e) If a training problem allows a closed form solution, would there be any advantage to using an iterative gradient based optimisation method? Explain. [5 marks]
(f) Explain how the expectation maximisation algorithm relates to the maximum likelihood estimate. [5 marks]
(g) With respect to training machine learning models, how does the maximum a posteriori estimate for model weights relate to the Bayesian posterior distribution over the weights? [5 marks]
(h) Probably approximately correct (PAC) learning theory aims to upper bound the true risk R[f] of a
learned model f e F by the empirical risk R(ˆ)[f] of that model plus an error term that might involve
a sample size m, VC-dimension VC(F), confidence parameter δ, etc. Why do PAC bounds hold only “with high probability 1 - δ”, and not deterministically? [5 marks]
Section B: Method & Calculation Questions [60 marks]
In this section you are asked to demonstrate your conceptual understanding of methods that we have studied in this subject, and your ability to perform numeric and mathematical calculations.
Question 2: Binary classification [14 marks]
Given a dataset comprising four training points x1 = [1, 1], x2 = [2, 1], x3 = [3, 3] and x4 = [4, 2], with labels y1 = -1, y2 = +1, y3 = -1 and y4 = +1. We decide to train a linear logistic regression binary classifier on the dataset.
(a) Draw a diagram illustrating the training data, with the decision boundary shown for two training methods: maximum likelihood estimate (MLE), and maximum a posteriori (MAP) estimate where a Gaussian L2 regularization term is applied to the weight vector. Ensure the graph is labelled clearly. [4 marks]
(b) Draw a second diagram showing the values of P (y = +1|x) along the horizontal line segment x = [α, 1] where 0 < α < 4, for both the MLE and MAP trained models. Your graph should have α as on the horizontal axis and P (y = +1|x) on the vertical axis. Ensure you label the two methods clearly. [4 marks]
(c) Next we try our classifier with a non-linear basis function, namely a radial basis function (RBF) . Explain how the Bayesian marginal likelihood of the training data can be used to select how many RBF centres to use, and justify why this method should lead to good generalization accuracy. [6 marks]
Question 3: Kernel machines [16 marks]
(a) The soft-margin support vector machine imposes box constraints on the dual parameters in training, namely 0 < λi < C. Explain how reducing the value of C has the effect of increasing the margin of separation of the training data. You should assume a linear kernel, and recall the margin can be expressed as where w are the primal weights. [8 marks]
(b) We wish to use the function k(u, v) = u╱Bv as a kernel, where B is a d × d matrix and u, v are both d dimensional vectors. Is this a valid kernel? Do any conditions need to be placed on B for this to be the case? Present a mathematical argument to support your answer. [8 marks]
Question 4: Probabilistic graphical models [14 marks]
Given the following directed probabilistic graphical model over binary random variables:
(a) For the alphabetical elimination order A,B, C, . . . , F show the resulting graph after running the elimination algorithm. [5 marks]
(b) What is the time complexity of the elimination algorithm for the graph with alphabetical elimination order, as given above? Explain your answer. [5 marks]
(c) What is the Markov blanket of B? Recall that the Markov blanket relates to the conditional distribution of B given the rest of the variables, and is defined as the minimal set of other variables
needed to evaluate the conditional distribution. Explain your answer. [4 marks]
Question 5: Vapnik-Chervonenkis Dimension [16 marks]
The two parts of this question are related to the VC dimension but are otherwise unrelated and can be answered separately.
(a) Consider any binary classifier family F of classifiers that output labels in (0, 1}. For any classifier f e F define its complementary classifier as f = 1 - f. That is, for any input x,
f (x) = 1 - f (x) = ,1(0) if f(if f)x(x) 0(1) .
Then define a new family F as being made up of all the complementary classifiers of family F. That is, F = (f : f e F}. How are VC(F) and VC(F) related, and why? [8 marks]
(b) Consider any finite input domain X = (x1 , . . . , xn} and a family of binary classifiers Fn﹐k containing all classifiers that output +1 on at most k < n inputs from X . That is, this family contains: the classifier outputting all -1; the n classifiers outputting +1 on only one input and otherwise -1; and so on up to classifiers outputting +1 on k inputs and -1 on n - k inputs. What is the VC dimension of this family for general n and k < n? Why? [8 marks]
Section C: Design and Application Questions [20 marks]
In this section you are asked to demonstrate that you have gained a high-level understanding of the methods and algorithms covered in this subject, and can apply that understanding. Answers should be about 1 page in length for the question.
Question 6: Surveillance in a Retail Store [20 marks]
Your task is to design a system to optimise the placement of a fake security camera in a small retail store. The fake camera is large and easily seen by nearby customers. Depending on where the fake camera is placed, shoppers will be discouraged or deterred from stealing different items from the store. There are 10 locations the fake camera may be placed, and the store doesn’t know which is best. The store would like you to design a system that decides which of the 10 locations to move the fake camera every day, to minimise total cumulative thefts in the store. You can assume the store performs a full inventory daily to accurately measure total theft.
(a) Considering the 10 locations as 10 possible actions per day, discuss what you could use as rewards or losses per day, that could be suitable either for a bandit or learning with experts solution. [4 marks]
(b) Suppose the fake camera is left in the same position over multiple days. To what extent are the daily rewards independent? [4 marks]
(c) Assuming independent rewards, Which setting is more appropriate for this problem: multi-armed bandits or learning with experts? (Where locations are either arms/actions or experts.) In your answer, make reference to whether in one day, rewards are observed for only the used fake camera location, or all locations. [6 marks]
(d) The store now wants to place a real camera in one of 10 locations. You may decide on where to place the camera, each day. Assume that the real camera is very small and invisible to customers. Assume the real camera can be used to accurately identify thieves nearby to the chosen location. To maximise the number of cumulative thieves identified, should you use bandit learning or learning with experts? Why. What rewards/losses would you use? [6 marks]
2023-06-19