SPCE0038: Machine Learning with Big-Data 2020
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
SPCE0038: Machine Learning with Big-Data
Alternative Assessment 2020
Question 1
(a) Draw a diagram of the basic logistic unit that is used as the core building block of artificial neural networks. For simplicity you can ignore the inclusion of a bias term. Describe the components of your diagram in words. [3 marks]
(b) Specify the equations that define the output a of the logistic unit given the inputs 北j and weights θj .
Again, you may ignore a bias term for simplicity. [4 marks]
(c) Using your logistic unit as a base building block, draw a diagram of a fully connected, feed-forward artificial neural network with three layers (one input, one hidden and one output layer), three input units, three hidden units, and one output node. Again, you may ignore a bias term for simplicity. [4 marks]
(d) Specify the equations defining the full artificial neural network of part (c), extending your equations given for a single logistic unit that you specified above in part (b). Again, you may ignore a bias term for simplicity. [6 marks]
(e) What typical cost functions are used to train neural networks for regression and classification problems? Specify the corresponding cost function equations for targets yi) and predictions pi), where i denotes training instance and j the output node. [6 marks]
(f) Explain what is meant for a network to be deep? [1 marks]
(g) Why do deep networks provide a powerful representation framework? Include a discussion of the universal approximation theorem. [6 marks]
Question 2
Gradient descent algorithms take a step η in the direction of decreasing gradient, where the update of parameter θ is given by a form similar to
θ ← θ 一 ηVθC(θ),
where C denotes the cost function and VθC the gradient of the cost function with respect to θ . The variable η is often called the learning rate. Gradient descent based algorithms are often used to train deep learning models.
(a) Briefly describe batch gradient descent and stochastic gradient descent at a conceptual level. [4 marks]
(b) Although stochastic gradient descent is often very effective, why are alternative optimisation algorithms typically considered for training? [2 mark]
(c) Describe the momentum optimisation algorithm, including the update equations. [3 marks]
(d) Describe the Nesterov variant of the momentum algorithm, including the update equations. [3 marks]
(e) Explain the concept behind the AdaGrad algorithm and how this can help with training (no need to include update equations). [4 marks]
(f) Explain the concept behind the RMSProp algorithm and how this can help with training (no need to include update equations). [4 marks]
(g) Adam is the standard go-to algorithm for training deep networks. Explain the components of the
algorithms considered so far that are included in the Adam algorithm. [3 marks]
(h) Deep networks have very large numbers of parameters and so can be prone to overfitting. Explain the dropout regularisation technique to avoid overfitting. Support your explanation with a diagram. [7 marks]
Question 3
(a) Describe the knowledge based approach to artificial intelligence. [4 marks]
(b) Describe the machine learning approach to artificial intelligence. [2 marks]
(c) Describe the traditional machine learning approach of feature engineering? [4 marks]
(d) Briefly describe supervised, unsupervised and reinforcement learning. [3 marks]
(e) For supervised learning, briefly describe the difference between regression and classification problems. [2 marks]
(f) Consider logistic regression for K classes, where the predicted probabilities for each class k are given by
pˆk = , with sk(x) =╱θ(k)、Tx,
for input x and parameters θ(k) (recall each θ(k) includes n features).
Consider the generalised cost function for logistic regression given by the cross entropy
m K
i=1 k=1
where i denotes training instance and m the total number of training instances. The target value of instance i for class k is denoted y
Show that the derivative of the cost function is given by
= i ╱pˆ) 一 yi)、x(i) .
Hint: For the term it may be convenient to consider the cases k = k\ and k k\ separately and then combine. Note also that yk = 1. [15 marks]
Question 4
(a) Explain the computational model of TensorFlow in terms of computational graph construction and execution. [3 marks]
(b) Explain the difference between TensorFlow Variable and Constant types. [3 marks]
(c) Explain what a TensorFlow Placeholder variable is and why it may be useful. [4 marks]
(d) Explain autodiff and its advantages. [4 marks]
(e) Consider the following TensorFlow code to set up a computational graph and execute it. Assume scaled housing data plus bias is an m × (n + 1) feature matrix and housing data target is an m × 1 target vector, where m denotes the number of training instances and n the number of features (n + 1 is the number of features when including a bias).
(i) Set up computational graph:
1 import t e n s o r f l o w as t f
2 reset _ graph ()
3
4 n _ epochs = 1000
5 l e a r n i n g _ r a t e = 0 . 01 6
7 X = t f . constant ( scaled _ housing _ data _ plus _ bias , dtype=t f . f l o a t 3 2 ,
8 name="X" )
9 y = t f . constant ( housing _ data _ target , dtype=t f . f l o a t 3 2 , name="y" )
10
11 theta = t f . V a r i a b l e ( t f . random _ uniform ( [ n + 1 , 1 ] , 一 1 . 0 , 1 . 0 ) ,
12 name=" theta " )
13 y _ pred = t f . matmul (X, theta , name=" predictions " )
14 e r r o r = y _ pred 一 y
15 mse = t f . reduce _ mean ( t f . square ( e r r o r ) , name="mse " ) 16
17 o p t i m i z e r = t f . t r a i n . GradientDescentOptimizer ( l e a r n i n g _ r a t e )
18 training _ op = o p t i m i z e r . minimize ( mse )
(ii) Execute:
1 i n i t = t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r () 2
3 with t f . S e s s i o n () as s e s s :
4 s e s s . run ( i n i t )
5
6 f o r epoch i n range ( n _ epochs ) :
7 i f epoch % 100 == 0:
8 p r i n t ( " Epoch " , epoch , " MSE =" , mse . e v a l ( ) )
9 s e s s . run ( training _ op )
10
11 best _ theta = theta . e v a l ()
What machine learning problem does this TensorFlow code solve? What optimisation algorithm is used? [4 marks]
(f) Write code to solve the problem given in part (e) using mini-batch gradient descent. You may find it helpful to base your answer on the code given in part (e) and then revise it where necessary. Assume you have available a function fetch batch to fetch each mini-batch, with signature specified below:
1 def fetch _ batch ( epoch , batch _ index , batch _ size ) :
2 . . .
3 r e t u r n X _ batch , y _ batch [12 marks]
Question 5
(a) Describe what Principal Component Analysis (PCA) is. [3 marks]
(b) Define the explained variance ratio. [2 marks]
(c) Explain what Kernel PCA is. [5 marks]
(d) Define the process of Local Linear Embedding (LLE). [4 marks]
(e) In the first step of LLE, for a set of training instances xi, with k nearest neighbours LLE will first reconstruct the xi as a linear function of these neighbours. Write down an equation that would describe this process, and any normalisation that is applied. [8 marks]
(f) The second step of LLE is to map the training instances into a d-dimensional space while preserving local relationships as much as possible. If zi is the d-space equivalent of xi then describe the condition that must be met. [8 marks]
2023-04-26