Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP90051 Statistical Machine Learning

Final Exam

Semester 2, 2021

Total marks:  120

Students must attempt all questions

Section A: Short Answer Questions   [40 marks]

Answer each of the questions in this section as briefly as possible. Expect to answer each question in 1-3 lines.

Question 1:     [40 marks]

(a) Suppose you have trained a soft-margin support vector machine (SVM) with a RBF kernel, and the

performance is very good on the training set while very poor on the validation set.  How will you change the hyperparameters of the SVM to improve the performance of the validation set? List two strategies.   [5 marks]

(b) Compared with the Root Mean Square Propagation  (RMSProp) algorithm, what’s the main draw-

back of the Adaptive  Gradient  (AdaGrad) algorithm?   [5 marks]

(c) What strategy makes the  Variational  Autoencoder  (VAE)  capable of applying gradient descent through the samples of latent representation z to the encoder? [5 marks]

(d) For the recurrrent neural network  (RNN) that takes a sequence as the input, explain why we need to use backpropogation through time to update weights.   [5 marks]

(e) Let E be the set of all extremum estimators, L be the set of all maximum- likelihood estimators, and

M be the set of all M - estimators.  Fill in the blanks in your answers with E,L,M to make the following expression correct:               .   [5 marks]

(f) We’ve seen that the square- loss risk of a parameter estimate θˆ is Eθ[(θ− θˆ)2] = [B(θˆ)]2 +Var(θˆ) while the square- loss risk of a supervised regression predictor is EX,Y[(Y − fˆ(X))2] = (E[Y] − [fˆ(X)])2 + Var(fˆ(X)) + Var[Y] where the last term is known as the irreducible error. Note that in these risks both θˆ and fˆ implicitly depend on a random training set. Why is there no irreducible error term in the first square-loss risk, even though it appears in the second risk?   [5 marks]

(g) What objective function is optimised during training of the Gaussian mixture model (GMM)? Give

your answer as a mathematical expression.   [5 marks]

(h) Explain why only one iteration of Newton-Raphson is required to train linear regression.   [5 marks]

Section B: Method & Calculation Questions   [60 marks]

In this section you are asked to demonstrate your conceptual understanding of methods that we have studied in this subject, and your ability to perform numeric and mathematical calculations.

Question 2: Probabilistic Graphical Models   [14 marks]

Consider the following undirected probabilistic graphical model (U-PGM) on five Boolean-valued variables, where the clique potentials f(A,B,C) and G(C,D,E) are shown below.

 

(a) Using the given clique potential tables, calculate the normalising constant  (aka partition function)

for the joint distribution on A,B,C,D,E .   [8 marks]

(b) Calculate Pr(A = F,B = F,C = F). You may leave your answer as a fraction. (If you were unable

to answer the previous part, leave the constant as Z in your workings here.)   [6 marks]

Question 3: Frequentist Parameter Estimation   [12 marks]

Consider an i.i.d. sequence of random variables X1,X2 , . . . coming from some distribution with mean θ . Consider a simple estimator θˆn = θˆ(X1 , . . . ,Xn) = X1  of the mean. That is, use the first observation X1 as the estimate and ignore the rest.

(a) Why is θˆn  an unbiased estimator of θ in general?   [3 marks]

(b) Is the estimator θˆn  consistent?   [3 marks]

(c) Explain why your answer to the previous part is correct.   [6 marks]

Question 4: VC Dimension   [14 marks]

In class we saw that the family of linear classifiers in R2  (the 2D plane) have Vapnik Chervonenkis (VC) dimension 3. This question asks you to consider  VC dimension of an unrelated family of classifiers also in  the plane.  Consider the family F of rectangle  classifiers  parametrised by reals a,b,c,d where a < b and c < d: for any instance x ∈ R2  we have the classifier,

fabcd (x)   =   

if a ≤ x1  ≤ b and c ≤ x2  ≤ d

otherwise

or in words, a rectangle classifier predicts +1 if the instance lies within the rectangle, and −1 if the instance falls outside the rectangle.

(a) What is the VC dimension of F. You do not need to justify your answer in this part—you will do

so in the remaining parts.   [2 marks]

(b) If you answered some number d to Part (a), prove that VC(F) ≥ d in this part by drawing d points that can be shattered by F. Demonstrate how each of the 2d  labellings is possible for these points: provide a new drawing for each labelling consisting of the points and a rectangle overlapping some of the points.   [10 marks]

(c) If you answered some number d to Part (a), prove that VC(F) < d + 1 in this part by arguing why there can be no set of d + 1 points that can be labelled in all ways by classifiers in F.   [2 marks]

Question 5: Support Vector Machines   [12 marks]

Assume that the following figure is an illustration of the decision boundary learned by a linear soft-margin SVM on a two-class data set, with points from one class labelled with blue squares and points from the other class labelled with red triangles.

 

The Lagrangian function of the soft-margin SVM is:

L(w,b,λ,β,ξ) =  + C ξi+对 λigi(w,b,ξ) +对 βi(ξi)

gi(w,b,ξ) = 1 − ξi− y(i)(wT x(i) + b)

where f(x)  =  wT x + b,  λi , βi   are Lagrange  multipliers  and ξi  is the  slack  variable  of the ith  data point, C is the slack penalty.

Given the seven points A,B C, D, E, F, and G as shown in the figure,

(a) Identify the data point(s) with λi  = C  (by index A-G. Note:  wrong points in will be deducted

marks).   [4 marks]

(b) Identify the data point(s) with ξi   = 0 (by index A-G. Note:  wrong points in will be deducted

marks).   [4 marks]

(c) If Point A is removed from the data set, will the decision boundary change? Explain why.   [4 marks]

Question 6: Convolutional Neural Network   [8 marks]

Consider the following simple Convolutional Neural Network.  The input is a RGB  image  (3  channels) with width and height equal to 224. The setting of the layers are shown in the figure. There are 32 filters in the convolutional layer and each filter has a size of 11 × 11.

 

(a) Compute the number of parameters in the convolutional layer (show the formula and give the final

result)   [3 marks]

(b) In order to reduce the number of parameters of the convolutional layer in the figure, while keeping

the receptive field the same, one possible method is to stack one 3 × 3 convolutional layer and one 9 × 9 convolutional layer.  Describe five different possible stackings, i.e., different combinations of multiple (can be two or more than two) convolutional layers.  (Note:  changing the order of the multiple convolutional layers is counted as the same stack, e.g.,  “Stacking a 9 × 9 and a 3 × 3 convolutional layer” is counted as the same stack as “Stacking a 3 × 3 and a 9 × 9 convolutional layer”.)   [5 marks]

Section C: Design and Application Questions   [20 marks]

In this section you are asked to demonstrate that you have gained a high-level understanding of the methods and algorithms covered in this subject, and can apply that understanding. Answers should be about 1 page in length for the question.

Question 7: Food Delivery Recommendations  [20 marks]

Your task is to design an automatic food  delivery  recommendation  system for UberEats.  (You do not need to be familiar with UberEats to answer this question: UberEats is an app that shows users nearby restaurants, users can choose one shown or search for one, then select a meal to be cooked and delivered to their home.)  Because UberEats makes money (a ‘commission’) whenever a user orders food on its platform, it wants to learn which restaurants to display to the user when the UberEats app is opened. The choice of restaurants displayed on the app front page may make it more likely for a user to order food on the platform, and ultimately how much money UberEats makes—its revenue.  UberEats wants to maximise its revenue using multi-armed bandits.

(a) Formulate the restaurant recommendation task as a contextual multi- armed bandit problem. Explain

what your arms are, what rounds correspond to, what are your rewards, and what context vectors you would use based on what data UberEats might possess.   [5 marks]

(b) Describe what data UberEats should collect, so that you could evaluate the effectiveness of multiple

bandit learners you might try.   [5 marks]

(c) It is common for users to browse for a restaurant on the app, but then ultimately not order their food from UberEats. Discuss whether and how MLinUCB (LinUCB with missing rewards) from project 2 could be applied in this setting. For example, should all such “browse but no order” events be used? What might be limitations of this approach?   [5 marks]

(d) UberEats has another recommendation to make: a budget of coupons (a free meal worth up to $25) every day.  UberEats wants to use bandits to learn which users  to give  coupons  to .  Compare this application to the above one—are there any key differences?   [5 marks]