Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP90049 Introduction to Machine Learning

Final Exam

Semester 2, 2021

Section A: Short answer Questions [27 marks]

Answer each of the questions in this section as briefly as possible. Expect to answer each question in 1-3 lines, with longer responses expected for the questions with higher marks.

Question 1: [27 marks]

(a) You are developing a model to detect an extremely contagious disease. Your data consists of 4000

patients, out of which 100 are diagnosed with this disease. You achieve 96% classification accuracy. (1) Can you trust the outcome of your model? Explain why. (2) what type of error is most important in this task? (3) Name at least one appropriate evaluation metric that you would choose to evaluate your model. [2 marks]

(b)  (1) Contrast Wrapper and filtering approaches in terms of how they define the best feature(s). (1-2

sentences) (2) Provide an example scenario where you would prefer using the Wrapper strategy in comparison to feature filtering. (1-2 sentences) [3 marks]

(c) You are given a dataset with three Boolean features, X = (X1,X2 ,X3 ), and a Boolean label, Y . You have trained (1) Logistic Regression (2) Naive Bayes (3) Perceptron to learn the mapping from X to Y . For each classifier, state whether using the learned model parameters, they can compute the P(X1,X2 ,X3 ,Y) or not and justify your answer. [4 marks] (N.B. if your answer is yes, write down the formula for calculating P(X1,X2 ,X3 ,Y), otherwise state why you can’t compute this probability)

(d) Suppose you are using gradient descent to train a logistic regression model:

θ(t+1) ← θ(t) − η

(1)

the loss function decreases but only very slowly, (1) what could be the reason and what should you do?  (1-2 sentences) (2) Describe one method for deciding when to terminate learning.  (1-2 sentences) [3 marks]

(e) Connect the machine learning algorithms on the left with  all  concepts  on the right that apply.

[4 marks] (N.B.  You may copy the answers onto your answer sheet.  You do not need to justify your answer.)

a. Logistic Regression

b. 5-Nearest Neighbor

c. Categorical Naive Bayes

d. Perceptron

e. Decision stump

f. Decision tree (depth: 10)

1. Generative model

2. Non-parametric model

3. Probabilistic model

4. Instance-based model

5. Linear decision boundary

6. Non-linear decision boundary

7. Parametric model

(f) Suppose you have trained a multilayer perceptron that contains three hidden layers for classification

on a given dataset.  The training accuracy of the model is very high but the validation accuracy is very low.  (1) What problem does the model suffer from?  (2) Describe one possible reason for this problem. (3) How can you change the number of layers and the number of units in the hidden layers to address the problem? [4 marks]

(g) Briefly explain why the Random Forest manipulates both instances and features for ensemble learn-

ing. [2 marks]

(h) In AdaBoost, if a sample is incorrectly classified by a base model, will the weight of the sample

definitely be increased? Justify your answer. [2 marks]

(i) Suppose you want to detect anomalies on a dataset with various densities. You compute the pair- wise distance between every two data points to identify the number of neighboring points within a distance D for every data point. You identify a point as an anomaly if the number of its neighbours is smaller than a certain threshold p. This method may not be able to identify all possible anomalies. Describe two possible reasons. [3 marks]

Section B: Method Questions [71 marks]

In this section you are asked to demonstrate your conceptual understanding of the methods that we have studied in this subject.

Question 2: Naive Bayes [7 marks]

You want to build a Naive Bayes classifier with 2 Categorical features each with three possible values, X1  ∈ {r,g,b}, X2  ∈ {l,m,h}, and a Boolean label, Y.

(a) What is the minimum number of parameters that you have to estimate to train your NB model?

[3 marks](N.B. write down the parameters that you have to  estimate and their total number)

(b) Explain the conditional independence assumption in Naive Bayes? [1 mark]

(c) What is the minimum number of parameters that you have to estimate if we don’t assume conditional independence? [3 marks](N.B. you don’t have to enumerate all the parameters, simply write down the total number and explain in 2 sentences how you achieved this number)

Question 3: Optimization [7 marks]

Consider the two plots of objective functions for a given model M:

(a) For each plot, name a model and a loss function that could result in this shape. Label the axes of

both plots accordingly. [4 marks]

(b) What strategy would you choose to optimize each objective function? [1 mark]

(c) Discuss one requirement/characteristic of gradient descent in the context of these two plots. [2 marks]

Question 4: Evaluation [9 marks]

Consider a binary classification task where we aim to learn a function that maps a 2-dimensional input to classes {1, −1}.  Training instances belonging to class 1 and −1 are denoted by blue circles and red crosses respectively.

(a) For each of the following learning algorithms, draw the decision boundary on the given training

dataset and justify your solution. [6 marks] (N.B. justify each decision boundary in 1-2 sentences . You  can  copy the image  and draw the  boundaries in your word/PDF document.   Word has  a  draw option, or use applications such as Preview and Markup  (Mac users) .  You may also copy the plots (approximately) onto your answer sheet, rather than annotating the exercise sheet directly if that is easier.)

(i) Logistic Regression

(ii) Zero-R

(ii) 1-NN

(b) Which algorithm (i-iii) results in the highest bias and which in the highest variance? Justify your

answer. [3 marks]

Question 5: Neural Network [15 marks]

In the following two-class dataset, X1  and X2  are the input features of the data, and Y is the output class.

X1

X2

Y

0.5

0.2

1

0.3

-0.4

0

-0.4

0.2

0

-0.3

-0.5

1

(a) Can we train a perceptron to perfectly classify the data? Explain why. [2 marks]

(b) Assume that you have built the following multilayer Perceptron (MLP) to classify the data.  X0

in the input layer is the bias node, which is set to 1 (For simplicity, no bias node is added to the hidden layer). a and a are two units in the hidden layer. a(2) is the output unit. The activation functions g of the hidden layer and output layer are Sigmoid function, i.e., g(x) = .  All the parameters of the MLP are initialized as 1. What is the prediction of this MLP on the data point [X1=0.3,X2=-0.4, Y=0]? [3 marks] (N.B. Show your working and provide the values of the hidden units to  obtain the output value .  Round your calculation by two  decimal digits .)

(c) Based on the MLP in Question (b) (No bias node is added to the hidden layer), assume that you want to use this data point [X1=0.3,X2=-0.4, Y=0] to update the parameters of the MLP based on backpropagation algorithm.  The loss function is: L = (Y a(2))2 .  The learning rate is set to 1. After one epoch of training on the selected data point, what are the new parameters of the network? (show your working and provide error of each node to update the parameters.) [10 marks] (N.B. round  your  calculation  by  two  decimal  digits .)   (Hint:   the  derivative  of the  activation function) g (x) = g(x)(1 − g(x))

Question 6: Decision Trees [15 marks]

In the following table, we have 5 instances with 3 attributes Suburb, Area, New, a Class Label. Each row is showing an instance.

(N.B.  Calculations up to two  decimal points)

Suburb

Area

New

Class

1

2

3

4

5

6

7

8

S1

S2

S3

S4

S5

S6

S4

S7

Large

Large

Large

Large

Medium

Large

Large

Small

N

N

Y

Y

Y

Y

Y

N

1

1

1

2

2

3

3

3

(a) Calculate the information gain and gain ratio of “New” feature on the dataset. [7 marks] (N.B.

use log2  to  compute the results of each step to get full marks .)

(b) Does a decision tree exist, which can perfectly classify the given instances? If yes, draw that decision

tree, otherwise, explain why not, by referring to the data. [2 marks]

(c) If we use Area” to build a decision stump, what is the the predicted label of decision stump for each of the 8 instances in the data set? [4 marks]

(d) If we use “Suburb” to build a decision stump, what would you expect to see for the accuracy of the decision stump given an evaluation dataset that you have not seen before? Explain why the stump has good/bad accuracy. [2 marks]

Question 7: K-means [10 marks]

Consider the following two clustering results for a dataset that contains 1D data points (each point is shown as a blue dot and the value of each point is shown under the point).

(a) Use Manhattan distance between data point and cluster centroid to assign a new data point X = 4

to one of the clusters. Which cluster will the point be assigned to based on the clustering result 1? Which cluster will the point be assigned to based on the clustering result 2? [7 marks](N.B. Show your mathematical working.)

(b) Which clustering result do you think is better? Select a criterion to justify your answer. [3 marks]

Question 8: Bias and Fairness [8 marks]

Consider the following data set consisting of 8 training instances, where each instance corresponds to an article written by an author.   Each article has four features:  Number of  citations, quality of publication venue (denoted by A*,A, and B where A* denotes the highest quality and B the worst), number of downloads since publication, and gender of the author. For the purpose of this question, we consider gender feature as a protected attribute. Each instance has a true binary label y which indicates whether the article is deemed ground breaking (1) or not (-1).  We also have access to predicted labels from a Multi-layer Perceptron classifier, yˆfull, which was trained to automatically predict the label from all available features.

ID

citations

quality

downloads

gender

y

yˆfull

1

15

A

140

Male

1

1

2

22

A*

175

Female

1

-1

3

44

A

13

Female

-1

1

4

33

B

46

Female

1

1

5

50

A*

63

Male

1

1

6

14

B

1

Male

-1

1

7

10

B

26

Male

-1

-1

8

4

A

11

Female

-1

-1

(a) Define in your own words the fairness criterion of Predictive  Parity in the context of the above

scenario. [2 marks]

(b) Is the full model (column yˆfull) fair with respect to the concept of predictive parity? [3 marks]

(N.B. Show your mathematical working.)

(c) Propose a strategy to improve the fairness of the Multi-layer Perceptron model in the context of the dataset given. [3 marks]


Section C: Design and Application Questions [22 marks]

In this section you are asked to demonstrate that you have gained a high-level understanding of the methods and algorithms covered in this subject, and can apply that understanding.  Expect your an- swer to each question to be from one third of a page to one full page in length.  These questions will require significantly more thought than those in Sections A–B, and should be attempted only after having completed the earlier sections.

Question 9: [22 marks]

Imagine that, after graduating from the University of Melbourne, you are hired by a job search engine company. The company has asked you to develop a tool that can assign a category to a CV (curriculum vitae, or resume).  You receive thousands of CVs that are uploaded by applicants for you to consider. For each submission, after processing the CV, you have access to the applicant’s structured profile that contains the following list of features:

• Overall GPA

• Degree Title

• Degree Major

• Date of degree completion

• Name of the applicant

• Duration of past employment

• Home address of the applicant

• Gender of the applicant

All submitted CVs belong to three primary categories of interest:  “Technology and Engineering”, “Ad- vertising, Arts, and Media”, and “Retail and Consumer Products”. You want to build a machine learning model that assigns incoming CVs to a job category. You do not have access to any labelled data to begin with.

(a)  (1) State a specific machine learning algorithm that is appropriate for this task in the beginning and

Justify your choice. (2) Explain each step of this algorithm in the context of this task. [5 marks]

(b) Assume now you have access to an additional small set of CVs which are labelled with their correct

job category. You train a multi-layer perceptron classifier to assign incoming CVs to a category. (1) Choose an appropriate evaluation strategy and justify your choice.  (2) Describe each of the steps you would follow in evaluating your model under this strategy in the context of the given task and data set. [6 marks]