Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP90051 Statistical Machine Learning

Final Exam

Semester 1, 2022

Total marks:  120

Students must attempt all questions

Section A: Short Answer Questions   [40 marks]

Answer each of the questions in this section as briefly as possible. Expect to answer each question in 1-3 lines.

Question 1:     [40 marks]

(a) The Adagrad  optimiser is a variant of stochastic  gradient  descent  with a dynamic  learning  rate .

Explain how the dynamic learning rate operates, and why this is important during optimisation of machine learning models.   [5 marks]

(b) Both convolutional networks  (CNNs)  and recurrent networks  (RNNs)  can be applied to sequence

inputs. Explain the key benefit that RNNs have over CNNs for this type of input.   [5 marks]

(c) Explain in words how Attention can be used to allow for neural models to process dynamic sized inputs.   [5 marks]

(d) Can the independence assumptions of all directed probabilistic graphical models  be represented by undirected probabilistic graphical models? Why or why not?   [5 marks]

(e)  Gradient descent is typically preferred over coordinate descent . Given this, why is coordinate ascent

use in the Expectation Maximisation algorithm?   [5 marks]

(f)  Weak duality  guarantees that a primal optimum  always upper bounds a dual optimum .  How can

we tell that the support vector machine’s primal and dual optima are always equal?   [5 marks]

(g) Why are both maximum- likelihood estimators  and maximum  a posteriori  estimators  both asymp-

totically efficient?   [5 marks]

(h) A model family F’s growth function  SF(m) could potentially be 2m, growing exponentially larger

with increasing sample size m. Why would this be bad news for the PAC bound with growth function for F?   [5 marks]

Section B: Method & Calculation Questions   [60 marks]

In this section you are asked to demonstrate your conceptual understanding of methods that we have studied in this subject, and your ability to perform numeric and mathematical calculations.

Question 2: The Elimination Algorithm   [14 marks]

Consider the following directed probabilistic graphical model (PGM) with Boolean-valued random variables

(a) What is the reconstructed graph (aka moral graph) of this PGM, for alphabetical elimination ordering

leaving just H . I.e., A,B,C,D,E,F,G?   [8 marks]

(b) What is the maximum clique size in the reconstructed graph?   [3 marks]

(c) Write the runtime complexity of the elimination algorithm , eliminating the nodes in the given order, in terms of the maximum clique size.   [3 marks]

Question 3: Vapnik-Chervonenkis Dimension   [16 marks]

The two parts of this question are related to the  VC dimension but are otherwise unrelated and can be answered separately.

(a) Consider an input  domain  of five points x1 , . . . ,x5 , and binary  classifier family  F defined by the

table of dichotomies  given below.   Calculate VC(F) and show a corresponding shattered set of points.   [8 marks]

1

2

3

4

5

0

0

0

0

1

1

0

1

0

1

0

1

0

1

0

0

1

1

0

1

0

0

0

1

0

1

0

1

1

1

1

1

0

0

1

0

1

1

1

1

(b) For any t ∈ [0, 1], define the function It  on x ∈ [0, 1] to be

It (x) =

What is the Vapnik- Chervonenkis dimension of the family F = {It  : t ∈ [0, 1]} and why?   [4 marks]

(c) Referring to the functions defined in the previous part, what is the Vapnik- Chervonenkis dimension of family F\  = F ∪ {1 − It  : t ∈ [0, 1]} and why?  (Hint: function 1 − It  outputs  0 on inputs x ≥ t, and 1  otherwise .)   [4 marks]

Question 4: Support Vector Machines   [16 marks]

Assume that the following figure is an illustration of the decision boundary learned by a linear soft-margin SVM on a two-class data set, with points from one class labelled with blue circles and points from the other class labelled with red squares.

 

The Lagrangian function of the soft-margin SVM is:

L(w,b,λ,β,ξ) =  + C 工(n)ξi +工(n) λigi(w,b,ξ) +工(n)βi(−ξi)

i=1            i=1                                  i=1

where

gi(w,b,ξ) = 1 − ξi− y(i)(wT x(i) + b)

and f(x) = wT x + b.  Here λi , βi  are Lagrange  multipliers  and ξi  is the slack  variable  of the ith  data point, C is the slack penalty.

Given the seven points A, B, C, D, E, F, and G as shown in the figure,

(a) Identify the training point(s) with λi = 0 (by index AG).   [4 marks]

(b) Identify the training point(s) with ξ 0 (by index A–G).   [6 marks]

(c) For each training point identified in the previous part, is ξi  < 0, 0 < ξi  < 1, 1 < ξi  < 2, or is ξi  > 2? [6 marks]

Question 5: Convolutional Neural Networks   [14 marks]

Consider a simple Convolutional Neural Network for processing 3 × 3 images, with a single filter of size 2 × 2, applied with no padding and stride of 1 × 1. The next step is global max-pooling, which takes the filter output (size 2 × 2) to produce a single output.

1. What  mathematical  function  is  defined  by  the  network,  relating  its  output  y  to  its  input x1,1 ,x1,2 , . . . ,x3,3 ? You will need to introduce new variables for the model’s parameters.   [6 marks]

2. Explain how a loss gradient  at the output is back-propagated through the network, and show the mathematical form of the loss gradient with respect to the model parameters.   [8 marks]

Section C: Design and Application Questions   [20 marks]

In this section you are asked to demonstrate that you have gained a high-level understanding of the methods and algorithms covered in this subject, and can apply that understanding. Answers should be about 1 page in length for the question.

Question 6: Online Content Testing   [20 marks]

Your task is to design an automatic content recommendation system for a major newspaper “TechNews” to use to select which headlines to use for articles on its online front page, and which pictures to display  for these front page article previews.  TechNews has very many users, and serves very many different  articles  across different categories (like local news, international news, opinion pieces, investment, etc.).  TechNews wants to encourage user engagement with their platform, so that users keep returning to the  news service (driving more subscription and ad revenue). TechNews management therefore will judge the  performance of its content recommendation system by cumulative clicks . Complicating matters, different  kinds of users typically click on different kinds of articles according to their tastes in article categories,  types of headlines, and picture previews, etc. Your goal is to recommend headlines and pictures to display  to maximise  the total number of clicks, where each article comes with a number of different available  headlines and picture previews to choose from.

(a) Formulate the content recommendation task  as a contextual multi- armed bandit problem . Explain,

what your arms are, what rounds correspond to, and what are your rewards .   [6 marks]

(b) What context vectors will you use based on what data TechNews might possess?   [6 marks]

(c) Comment on whether the typical bandit assumption of independent and identically distributed re- wards per arm is appropriate. Does context make this assumption more or less realistic?   [4 marks]

(d) Describe what data TechNews should collect, so that you could efficiently evaluate the effectiveness of a huge number of bandit learners.   [4 marks]