Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SPCE0038:  Machine Learning with Big-Data

Exam 2021

Question 1

(a)  Explain how linear regression may be used to t a polynomial model that is non-linear in the data features. [3 marks]

(b)  Is a high-dimensional polynomial model likely to be a good model to use for a machine learning regression

problem?  Explain your reasoning. [3 marks]

(c)  How would you compute a clean‘’ model prediction on each data instance provided (“clean‘’ in the sense that when evaluating a model on a data instance that data instance has not been used in tting the model)?  Illustrate your explanation with a diagram. [6 marks]

Consider the underlying (true) model

y = f (z) + e,

where f is the true model, to be approximated by h, z is an object feature vector, and e is noise, with zero mean and variance σ 2 .

(d)  Explain the three contributions to the mean square error. [3 marks]

Show that

(y  h(z))2] = Bias2 [h(z)] + Var [h(z)] + σ 2 .                        [5 marks]


Question 2

(a)  For a two-class supervised classification problem, explain conceptually (without any equations) how support vector machines (SVMs) classify data instances.  Include a discussion of both hard and soft margin classification.  Illustrate your explanation with diagrams. [10 marks] 

(b) What are the characteristics of machine learning problems for which SVMs are well-suited?    [2 marks]

(c)  Consider a trained linear SVM for two dimensional data.

The decision boundary is given by w ! z + b = 0 and the margins by w ! z + b = 士1, where w is the weight vector, z is the data instance vector and b is the bias.

Alternatively, these expressions may be expanded for the two dimensional setting to give w0x0 + w1x1 + b = 0 and w0x0 + w1x1 + b = 士1, respectively.

Derive an expression for the size of the margin (defined by the shortest distance between the lines defining the two edges of the margin, i.e. between lines w ! z + b = 1 and w ! z + b = 一1). [8 marks]


Question 3

(a)  In a convolutional neural network, describe what a Max Pooling layer does, and why one may want to include such a layer. [4 marks]

(b)  Describe what the stride is, with respect to a max pooling layer. [2 marks]

(c)  Describe the two different types of padding – valid and the same that one may use with respect to a convolutional neural network. [4 marks]

(d)  Consider the following image, with pixel values shown as integers, as an input layer.  In all cases we consider the lower left corner of the image as being the point where any operation on the image begins. [10 marks]

1

2

3

4

5

6

3

5

6

7

8

9

4

4

4

4

6

6

2

6

7

8

9

0

1

6

8

9

0

8

1

3

7

3

5

8

(i)  If one uses a max pooling layer with a filter size 6x6 pixels, using no padding, and a stride length of 6 pixels draw the resulting receptor layer.

(ii)  If one uses a max pooling layer with a lter size 3x3 pixels and a stride length of 3 pixels, using no padding, draw the resulting receptor layer.

(iii)  If one uses a max pooling layer with a lter size 2x2 pixels and a stride length of 2 pixels, using no padding, draw the resulting receptor layer.

(iv)  If one uses a max pooling layer with a lter size of 3x3 pixels and a stride length of 1 pixel, using no padding, draw the resulting receptor layer.

(v)  Consider the following code snippet, where image is the array under consideration in this question, and draw the resulting output array.

max pool  =  keras .layers .MaxPool2D(pool size=4,padding="valid") output  = max pool(image)

(vi)  Consider the following code snippet, where image is the array under consideration in this question, and draw the resulting output array.

max pool  =  keras .layers .MaxPool2D(pool size=4,padding="same") output  = max pool(image)

Question 4

You want to build a prediction model trained on a large volume of experimental measurements. The data will be provided by an international organisation, which has been coordinating the experiments and the reporting of their results according to a well-defined protocol and standard.  The protocol defines, for example, how many and what measurements an experiment may provide, or which values may be missing and under what conditions. A lot of care has been taken to ensure that the data is consistent and adheres to this standard. All experiments are now complete and the collection is made available as a relational database.

(a)  From the above description, why is a relational database a good choice for this dataset? What would

you gain by using a NoSQL database instead? [3 marks]

Once you connect to the database, you intend to extract and preprocess some of its contents into a CSV file. You will then t an appropriate classification model on that data, and nally make the model available to others. You are not sure what preprocessing scheme or classifier you will use, and would like to try different options. Additionally, the dataset is very large, and the training takes a very long time on a single personal computer.

(b)  For each of the following tasks, suggest one appropriate technology (such as a tool or library) and briefly explain how it helps with that task:

● Switching between different preprocessing methods and classifiers.

● Training the classifier efficiently.

● Sharing the trained model.

(For example, the answer for a data storage task could be: “SQL/relational database:  allows users to programmatically store, access and query data") [6 marks]

(c)  For the pipeline mentioned above (preprocessing and training classifier), give the contents of a file in the YAML format that describes this pipeline, such that it can be reproduced automatically using DVC. As- sume that the connection to the database and the preprocessing is done in a le called preprocess .py, which produces the le data .csv, and that the model is tted and saved in classify .py. The Python files can be run as e.g. python preprocess .py . The tting can be configured through two parameters: “degree” and “bias” (you can assume that these are specified in another suitable le, but do not need to provide that in your answer). [8 marks]

(d) Which DVC command can be used to run the whole analysis this file describes? Assume that you run the analysis this way, make a modification to the classifier parameters, and run that command again. Will the whole pipeline be rerun, and why (not)? [3 marks]