Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DEPARTMENT OF AUTOMATIC CONTROL & SYSTEMS ENGINEERING
Autumn Semester 2020‒21
ACS6427 DATA MODELLING AND MACHINE INTELLIGENCE
1. a) Show that the least square estimate for 0a in the simple linear regression problem
0y a ε= +
from n observations of response y, i.e., 1, , ny y is the average of these observations.
Here ε is the modelling error of a zero mean.
[5 marks]
b) Consider a linear regression problem where only one predictor x1 is involved and the
relationship between the predictor and response y is
1y cx ξ= +
in which ξ is a NONZERO mean modelling error.
A set of 5 observations of 1 and y x are in the table below.
i 1 2 3 4 5
iy 3.08 4.09 5.01 6.09 7.06
1ix 2 3 4 5 6
i) Find the least square estimate of parameter c from the observed predictor and
response data.
[8 marks]
ii) Determine the Total Sum of Squares (TSS), Residual Sum of Squares (RSS),
Explained Sum of Squares (ESS) of the estimated linear regression model,
respectively.
[3 marks]
iii) Find the 2R statistic of the estimated linear regression model and assess the
model accuracy using the 2R statistic.
[4 marks]
2. a) A logistic function-based two class classifier has been determined as
( )0.1 0.2
1ˆ
1 x
y
e− +
=
+
i) Find the probability for classification result "1"y = estimated from this logistic
classifier when x = ‒3, ‒2, 0, 7, and 10, respectively.
[2 marks]
ii) Assume the true response y is as shown in the following table when x = ‒3, ‒2, 0,
7, and 10, respectively, and that the threshold T for the logistic classifier is T = 0.5.
x ‒3 ‒2 0 7 10
y 0 0 1 0 1
Find the sensitivity, specificity, false negative rate, and false positive rate of the
classifier, respectively
[4 marks]
iii) Show a sketch of the ROC curve of the classifier using the sensitivity and specificity
when T is chosen as T = 1, T = 0.5, and T = 0, respectively, and explain why the
AUC of the ROC curve is often needed to assess the performance of a classifier.
[4 marks]
b) A set of 10 observations of predictors xi=(xi1, xi2 ), i=1,…,10, are collected and shown
in the following table. The same observations are plotted in Figure 2.1 (overleaf).
xi x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
xi1 1 1.5 2 2.5 2.5 2.7 3 4 5 6
xi2 2 0.8 2.5 1 4 2 3.5 2.4 3.6 2
Figure 2.1
In order to apply K-mean clustering with K=2 to these observations, at the first step, two
centroids are randomly selected as (2,1.5) and (4, 3.5), respectively.
(i) Apply the principle in the second step of K-mean clustering to cluster the 10
observations to two subgroups C1 and C2 and show which of the 10 observations
are in C1 and which in C2.
[4 marks]
(ii) Find the centroids of C1 and C1 determined in part (i).
[2 marks]
(iii) Find the updated subgroups C1* and C2* using the centroids determined in part (ii)
and show which of the 10 observations are in C1* and which in C2*.
[4 marks]
3. a) As part of a project, you are performing a text mining experiment, looking for the term
“Karhunen–Loève”. This has involved investigating 5 different data sources, {,,,,}, of which the term appears only in document {}. Based on all documents
the TFIDF is 2.796. What is the term frequency for Document {}?
[2 marks]
b) In black-box modelling we rely on the data to help produce the models we will build.
What elements of this data must we consider as we begin to build our model? In this
context explain the requirement for cross-validation. Provide pseudo-code to outline the
process of k-fold cross validation, explaining each step and showing how the process
would change for different values of k.
[6 marks]
c) Data modelling and machine learning algorithms are often deployed in complex and
challenging scenarios. The trained algorithms will often perform poorly, even though this
may not have been intended at the design stage. As part of a newly developed team
working for a small technology-focussed company, you have been asked to develop an
algorithm to identify and predict good candidates for interview from their submitted
Curriculum Vitae (CV). Discuss how you might tackle this problem in order to ensure that
the system you develop is robust to presentation of a variety of candidates.
[12 marks]
4. a) A dataset has been provided for you to analyse. You are concerned that the dataset may
not have been presented optimally and wish to investigate this further. The dataset has a
Covariance Matrix,
= �11 −5 2−5 9 −32 −3 8 �
which produces an eigenvector matrix,
= �−0.5384 0.6934 0.4789−0.0912 −0.6129 0.78490.8378 0.3789 0.3932�
and set of eigenvalues,
= � 7.041216.51214.4468 �
(i) Discuss the application of Principal Component Analysis (PCA) to this dataset, and
explain what the application of PCA would achieve. Explain the geometrical
relationship between the principal components.
[3 marks]
(ii) Which direction vector listed for this dataset gives the first principal component of the
data? Discuss why this is the case. How much variance is contained within each
principal component?
[5 marks]
b) As part of training a two input ({1, 2}) logistic classifier, you believe that the performance
is not fit for purpose.
(i) What steps would you take to incorporate non-linearity into the decision boundary
for the model? Show how a cubic function might be implemented.
[3 marks]
(ii) Describe the issues that may arise through the implementation of an arbitrary-
shaped decision boundary
[2 marks]
c) A set of data relating pressure and flow rate in a mains water system has been provided
in Table 4-1. From basic theoretical considerations you have determined that the two
variables are linked by a non-linear model of the form = 12.
Table 4-1: Data for Q4(c)
Flow Rate,
F (m3/s)
0.436 0.586 0.614 0.659 0.764 0.9467 0.9854 1.07
Pressure,
P (Pa)
75842 117211 137895 172369 275790 298705 356210 379212
(i) The model being considered for the problem is non-linear. Show that a linearisation
can take place, and provide the new variables for the linear model.
[2 marks]
(ii) Solve for optimal values of weights within this linearised model, and thus provide
values for model parameters that are to be estimated: 1 and 2.
[5 marks]