闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP90049 Introduction to Machine Learning

Final Exam

Semester 1, 2021

Section A: Short answer Questions [25 marks]

Answer each of the questions in this section as brieﬂy as possible. Expect to answer each question in 1-3 lines, with longer responses expected for the questions with higher marks.

Question 1: [25 marks]

(a) Name one classiﬁer which can achieve perfect test performance on any linearly separable data set;

and one classiﬁer which cannot. [2 marks]

(b) You trained a random forest for the task of geolocation classiﬁcation. Your classiﬁer achieves very

high performance on the training set, but low performance on the test set. (i) What is the problem? (ii) Name two possible reasons for this behavior. [3 marks]

(c) For each of the three feature selection methods Wrapper, Filter and Embedded Methods: (i) describe in your own words how it measures the “usefulness” of a feature; (ii) describe in your own words a scenario where it would be more appropriate than the other two methods. [6 marks]

(d) Consider a multi-class classiﬁcation problem over K classes, where for each instance we observe a label y as a K-dimensional 1-hot vector. We also assume a classiﬁer which predicts yˆ: a K- dimensional distribution over the same set of labels. ymax is the true label of the instance and yˆmax is the predicted label (i.e., the class with highest probability assigned by the classiﬁer).

Consider the following loss functions La , Lb and Lc, deﬁned for a single input instance,

La = yklog yˆk

k5〇

Lb = (yk - yˆk)1

k5〇

Lc = ,0(1)

if yˆmax == ymax

otherwise

(i) In your own words, describe how each of the loss function measures the quality of a model prediction. In other words: describe the intuition behind each loss function. [3 marks]

(ii) Can all three loss functions be used to optimise a Multi-layer perceptron? Why (not)? [1 mark]

(iii) Which of the three loss functions is the most appropriate for a classiﬁcation task? Why?

[1 mark]

(e) You are applying leave-one-out cross validation to evaluate your latest machine learning model.

Your boss is concerned that your approach leads to high evaluation variance because of the very small test sets (just one instance). Is your boss right? Justify your answer. [2 marks]

(f) Connect the machine learning algorithms on the left with all concepts on the right that apply. ( You

may copy the answers onto your answer sheet. You do not need to justify your answer.) [3 marks]

1-Nearest Neighbor

3-Nearest Neighbor

Naive Bayes

Multi-layer perceptron Decision stump Decision tree (depth: 3)

Parametric model

Non-parametric model

Probabilistic model

Instance-based model

Linear decision boundary

Non-linear decision boundary

Generative model

(g) You are developing a model for diagnosing a highly contagious disease from a blood sample. Which

of the following metrics is the most important to optimize: (a) precision; (b) recall; (c) accuracy;

(d) F-1 measure; (e) None of them. Justify your choice. [1 mark]

(h) (i) Explain in your own words the problem of constrained optimization. (ii) Explain in your own

words how this concepts relates to evaluating classiﬁers for fairness, naming both the target and the constraint(s). (N.B. no formula or calculations are necessary, providing the intuitions is suﬃcient.) [3 marks]

Section B: Method Questions [75 marks]

In this section you are asked to demonstrate your conceptual understanding of the methods that we have studied in this subject.

Question 2: Feature Selection [17 marks]

You want to explore a data set of Nutrition information, where each instance is a fruit or vegetable

characterized by three features: shape, color and sweetness. The target class is VITAMIN-C level.

ID shape color sweetness VITAMIN-C

1 oval green 5.2 HIGH

2 round red 5.0 HIGH

3 pointy orange 1.0 LOW

4 round red 4.8 HIGH

5 round purple 4.3 LOW

6 oval brown 0.3 LOW

7 square blue 0.2 LOW

8 oval green 0.8 HIGH

9 round red 2.1 HIGH

Your favorite classiﬁer accepts discrete features only. You want to compare three methods of feature discretization, and ultimately select the best one.

(N.B. Show your mathematical working for each sub-question.)

(a) Discretize the Sweetness feature into three equal-width bins [2 marks]

(b) Discretize the Sweetness feature into three equal-frequency bins [2 marks]

(c) Discretize the Sweetness using K-means clustering, with K=3 and L1 (Manhattan) distance. Your initial centroids are c〇 = 0.5, c1 = 1.0, c2 = 2.0, where ci refers to cluster i. Compute two rounds of updates. [6 marks]

(d) (i) Compute the Mutual Information (MI) of the Sweetness feature after discretization by K- means (part (c)) with the class label. (N.B. as deﬁned in the lectures, logarithms should be base 2.) [6 marks]

(ii) The MI of Sweetness after equal-width discretization (part (a)) with the class label is 1.11, and the MI of Sweetness after equal-frequency discretization (part (b)) is 0.85. Which of the three discretization methods would you choose based on MI? [1 mark]

Question 3: Classiﬁcation with Missing Features [9 marks]

Real world data sets very often have missing features, i.e., some instances do not have a value for one or more features. More formally, assume that for each data instance i, we observe a label yi, a set of features xi consisting of observed features oi = {oi(〇) . . . , oi(m)}, and missing features mi = {mi(〇) , ..., mi(k)} with no associated value.

Assume a trained, probabilistic classiﬁer which predicts labels yi from features xi: P (yiIxi)

(a) Using the statistical concept of marginalisation, and the notation introduced above, derive math- ematically (that is: write equations) a classiﬁer which predicts yi using both observed features oi and missing features mi . (N.B. Show your mathematical working.) [6 marks]

(b) Is your classiﬁer discriminative, generative, neither or both? Justify your answer by referring to

your derivation in the ﬁrst part of the question. [3 marks]

Question 4: Evaluation [11 marks]

Consider the following two sets of plots. Plots (1)–(3) depict three decision boundaries (green), learnt by three diﬀerent models over the same data set. The data set consists of instances, each described by two features (x〇 , x1 ) and a class label (x or o):

(1)

Plots (i)–(iii) depict learning curves.

(i)

(2)

xx(xx)xx(xx)x oo(oo)xx(oo)o

x x x x

(ii)

(3)

x xx

(iii)

(a) Provide a plausible label for the x-axis, y-axis, and the two lines (red and blue) in plots (i)–(iii).

[3 marks]

(b) Find the most plausible 1:1 alignment of plots (1)–(3) with plots (i)–(iii). Justify your choice,

referring to the concepts of bias and variance, and model complexity. [6 marks]

Question 5: Fair Classiﬁcation [16 marks]

Consider the following data set consisting of 8 training instances, where each instance corresponds to an applicant for a job. Each instance has four features: work experience (in years), education (in years), LinkedIn page views, and gender encoded as binary female (value=1 if female, 0 if male). For the purpose of this question, we consider the female feature as a protected attribute. Each training instance has a true binary label y which denotes whether the applicant received a high (1) or low (-1) suitability score. We also have access to predicted labels from some classiﬁer , yˆfull, which was trained to automatically predict the label from all available features.

ID	work	edu	views	female	y	yˆfull
1	15	6	1300	0	1	1
2	22	10	1700	1	1	-1
3	44	6	150	1	-1	1
4	33	0	470	1	1	1
5	50	7	700	0	1	1
6	14	3	6	0	-1	1
7	10	4	300	0	-1	-1
8	4	5	130	1	-1	-1

(a) Deﬁne in your own words the fairness criterion of equal opportunity in the context of the above

scenario. [2 marks]

(b) Is the full model (column yˆfull) fair with respect to the concept of equal opportunity? (N.B. Show

your mathematical working.) [3 marks]

(c) (i) Deﬁne in your own words the concept of fairness through unawareness in the context of the above scenario. (ii) Would the resulting model be a truly fair classiﬁer? Justify your answer [2 marks]

(d) Train a Perceptron implementing fairness by unawareness , using the data set given above as training examples. Perform two training steps, i.e., process only the ﬁrst two instances (ID 1 and 2) in the data set. Assume the following:

– learning rate η = 0.1

– bias=1

– all parameters in 9＋ are initialized as 0;

– step

function

f (z) =

-1

if z > 0

otherwise

(N.B. Show your mathematical working.) [9 marks]

Question 6: Probability [6 marks]

You developed a classiﬁer which predicts whether a German movie will be successful in Australia, or not. ”Successful” here is deﬁned as > 10, 000 viewers in Australia within the ﬁrst 4 weeks after release. From historical data it is known that 0.5% of all German movies turn out successful in Australia. After some development and evaluation, you ﬁnd that your classiﬁer has a false positive rate of 4%, and a false negative rate of 1%.

(N.B. Show your mathematical working for each sub-question.)

(a) What are the odds that a random novel German movie will be unsuccessful in Australia? [2 marks]

(b) Your classiﬁer predicts “successful” for a new German movie. What is the probability that the

movie will indeed be successful? [4 marks]

Question 7: Ensembling [16 marks]

The following graph shows the error of three classiﬁcation models and a random baseline on 15 individual test instances (x〇 , . . . , x〇 5). The error for each test instance is a continuous number between 0 and 1, where 0 is best.

(a) Explain the general concept of ensembling in the context of the above scenario. (N.B. you do not

need to provide formulas or perform computations.) [2 marks]

(b) Would you expect ensembling to improve performance? Why (not)? [3 marks]

For the remainder of the question, consider a slighlty diﬀerent scenario:

A team of data scientists has access to a labelled training data set and found that several diﬀerent classiﬁers lead to severe {overfitting, underfitting}. They consequently decided to apply Boosting using AdaBoost algorithm with Decision Stumps as their base classiﬁers. In doing so, they expects to obtain more {complex decision boundaries, stable predictions}.

(d) Select the most appropriate term from the {options} (underfitting/overfitting and complex decision boundaries/stable predictions) in the text above. [1 mark]