Machine Learning for Engineers 2023 Coursework 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Coursework 2
Machine Learning for Engineers 2023
November 24, 2023
1 Guidelines
Setting up the coursework
To start, download the file “cw2.zip” from the module’s Keats website. Once this is done:
1. Unzip the file “cw2.zip” in a folder of your choice. We will refer to this folder as “
2. Change the the name of your unzipped folder to your k-number. For instance, if your k number is ”k12345678”,the file “run_coursework.m” should be located at “
3. Open your MATLAB editor, and make sure that the file explorer (upper left section of your editor) is located at “
Instructions
You should now find two MATLAB scripts (“run_coursework.m” and “check_dimensions.m”), a se-
ries of MATLAB function files, three ”.mat” files, and the current PDF file (“coursewor_1.pdf”). In
this coursework,you will complete a series of functions that will be called in the script “run_coursework.m” . By running the file “run_coursework.m”, you will be able to inspect the results of the functions you
modified. You should modify ONLY the code inside the mentioned functions at each ques- tion. Most importantly, you should NOT edit the provided functions! The functions to edit will be indicated at the end of each question. We will also specify the format of the inputs and the required format of the outputs.
For instance, for a function named “example”, which sums the entries of a vector, the question will specify the following:
Function file: “example.m”
Input format: N × 1 vector “v”
Output format: scalar value “sum_v”
Function signature: function sum_v = example(v)
The file “example.m” will then contain the following
1 function out = example(v)
2 % Write your code here
3 out = rand(1); % Placeholder (to delete and replace upon completion of the question)
4 end
By default, each function will have placeholder output (“out = rand(1)” in this example) that ensures that the code runs, even though the result will be purely random. If you skip a question, you can leave the placeholder to ensure the file “run_coursework.m” still runs. Upon completion of a question, you will have to remove the placeholder and replace it with your own answer.
You are ONLY allowed to modify the code INSIDE the function. You must NOT modify the name of the function, and you must NOT modify the signature of the the function (i.e., the name of the inputs/outputs)! Additionally, your function should only rely on the given inputs to generate an output; it must not use variables in the workspace other than the ones given as input and the ones created within the function itself.
Once a function has been coded, you can run the script “run_coursework.m” to make sure the function behaves as intended. Note that the file “run_coursework.m” will run all of the coursework at once, so you might want to run only the lines of “run_coursework.m” up to your current question. Comments in “run_coursework.m” indicate which lines concern which questions. Please avoid submit- ting functions that display text (remember that assigning a value to a variable will display it unless a “ ;” is put at the end of the line). It is recommended to use the MATLAB debugger instead of function “disp” to inspect the behaviour of your code. Finally, you are not allowed to use MATLAB toolboxes: the coded functions should only contain built-in MATLAB functions such as the ones seen in lectures or tutorials (e.g., “mean”, “sum”, “*”, “.*”, “binornd”, ...).
Submitting your coursework
Before submitting your work, you should clear your workspace (right click on the “Workspace” section of your editor > “Clear Workspace”) and verify that the script “run_coursework.m” runs well and gives the intended results. You should also run the file “check_dimensions.m” to make sure that each the outputs of each function have the right dimensions. No points will be awarded for functions that do not output the right dimensions or to functions that raise an error.
To submit your work, compress the folder containing the MATLAB files into a ZIP file with your k-number as its name. For instance, if your k-number is ”k12345678”, the ZIP file should be “k12345678.zip” . Please verify that the ZIP file directly contains your MATLAB files, and not an intermediary folder. Additionally, please check that the compression was done under the asked ZIP format (and not using another compression software such as, e.g., “WinRAR”). Finally, submit the your ZIP file over KEATS.
You must ensure that the ZIP file can be readily accessed for marking. A file that cannot be opened (for instance, a corrupted or encrypted file) will automatically result in a mark of ZERO.
2 Introduction
The objective of this coursework will be to train a binary classifier capable of predicting if a given word sequence comes from a sonnet of the world-renowned play-writer and poet William Shakespeare, or from the song lyrics of the globally-acclaimed singer and song-writer Taylor Swift. We will first recap some notions of text representation introduced in Coursework 1, before presenting the available training and test data.
Text representation
A written sentence can be decomposed as a sequence of words. In this coursework, we will use the nota- tion sK to denote a sequence of k words. For instance, the sentence “My heart has been borrowed and yours has been blue.” is given by the 10-words sequence (disregarding the punctuation and upper/lower cases):
s10 = [“my” , “heart” , “has” , “been” , “borrowed” , “and” , “yours” , “has” , “been” , “blue”] .
Note that two different orderings of the same words represent two distinct sequences. For instance, [“my” , “heart” , “has” , “been”] and [“has” , “my” , “heart” , “been”] are two different sentences.
The words composing a sentence are taken from a discrete vocabulary set V = [wm ]m(M)=1 of M
different words wm for m ∈ {1,..., M}. For instance, with the vocabulary set V = [wm ]m(8)=1 =
[“my” , “heart” , “has” , “been” , “borrowed” , “and” , “yours” , “blue”], the above sequence s10 can be equivalently written as
s10 = [w1 , w2 , w3 , w4 , w5 , w6 , w7 , w3 , w4 , w8],
where the words w3 = “has” and w4 = “been” have been used twice.
In order to represents the M words of the vocabulary V = [wm ]m(M)=1 as numbers, we define the
discrete set X = {1,..., M}, where x ∈ X represents the word wx ∈ V. For instance, for the same
vocabulary set V = [wm ]m(8)=1 = [“my” , “heart” , “has” , “been” , “borrowed” , “and” , “yours” , “blue”]
as above, the sequence s10 can be expressed as the 10 × 1 vector
x10 = [1, 2, 3, 4, 5, 6, 7, 3, 4, 8]⊤ ,
where we use xK to denote the integer representation of a sequence sK of k words.
From a probabilistic perspective, a sequence sK of k words is modelled as a discrete k × 1 ran- dom vector xK = [x1 , ..., xK ]⊤ , where the random variable xk represents the k-th word in xK . For k ∈ {1,..., K}, the random variable xk , takes values in the set X, and a realization xk represents the
word wxk in the vocabulary V = [wm ]m(M)=1 . For instance, x10 = [1, 2, 3, 4, 5, 6, 7, 3, 4, 8]⊤ is a realiza-
tion of the random vector x10 representing the sequence s10 = [w1 , w2 , w3 , w4 , w5 , w6 , w7 , w3 , w4 , w8] = [“my” , “heart” , “has” , “been” , “borrowed” , “and” , “yours” , “has” , “been” , “blue”] under the vocabu-
lary V = [wm ]m(8)=1 = [“my” , “heart” , “has” , “been” , “borrowed” , “and” , “yours” , “blue”].
Dataset
The available dataset D = {(xi(K), ti
(or input) xi(K) is a K×1 integer representation of a word sequence of length K = 10,and ti ∈ {0, 1} is its
corresponding target, for n ∈ {1,..., N}. A target value ti = 0 indicates that the corresponding input
xi(K) comes from a poem written by William Shakespeare, while the value ti = 1 indicates that xi(K) comes
from the lyrics of a song written by Taylor Swift. Furthermore, we will write xi(K) = [x1(i),..., xK ]⊤ ,
where xk(i) ∈ X denotes both the k-th element of the vector xi(K) for k ∈ {1,..., K} and, equivalently,
the integer representation of the k-th word of the sequence si(K) represented by xi(K) .
The dataset D is made available through the file “dataset.mat”, and comes already split as Ntr = 3109 training data points inside the variables “X_tr” and “t_tr”, and Nte = N − Ntr = 345 test data points inside the variables “X_te” and “t_te” . More precisely, the training inputs “X_tr”
take the form of the Ntr × K matrix
l (x1(K))⊤ 」
X_tr = ' ' ,
[(xN(K)tr )⊤l
and the test inputs “X_te” take the form of the Nte × K matrix
l(xN(K)tr +1 )⊤」
X_te = ' ' ,
[ (xN(K))⊤ l
while the corresponding training and test targets take the form of the Ntr × 1 binary vector t_tr = [t1 ,...,tNtr ]⊤ and the Nte × 1 binary vector t_te = [tNtr +1 ,..., tN ]⊤ , respectively.
Throughout this coursework, we will use the notation (xi(K), ti ) to denote any pair of K ×1 covariate vector xi(K) and binary target ti ∈ {0, 1} in the available dataset D, with i ∈ {1,..., N}. Similarly, X K = [(xi1(K))⊤ , ..., (xin(K))⊤]⊤ will denote any set of n covariates represented n×K input matrix, and t = [ti1 ,..., tin ]⊤ will denote its corresponding targets as an×1 binary vector, with {i1 ,..., in } ⊂ {1,..., N}.
Feature representation
Instead of directly working with the K ×1 integer representations xi(K), the models built in this course-
work will use a D × 1 feature representation u(xi(K)) of each sentence xi(K) . We will consider here a
one-dimensional word-embedding, which maps each integer xk(i) in xi(K) to a real number emb(xk(i)) ∈ ❘.
Accordingly, the feature representation u(xi(K)) = [emb(x1(i)), ..., emb(xK )]⊤ is a D × 1 vector of real
numbers with D = K.
This feature representation is made available in MATLAB through the function “toFeatures”, which takes as input any n × k integer matrix
l x1 , 1 ... x1,k」
X = ' . . . ' ,
[xn,1 ... xn,kl
for n ∈ {1,..., N} and k ∈ {1,..., K}, where xi,j ∈ X, and outputs the n × k real matrix
lemb(x1 , 1 ) ... emb(x1,k)」
toFeatures(X) = ' . . . ' ,
[emb(xn,1) ... emb(xn,k )l
of embeddings applied element-wise. For instance, any feature vector can be readily obtained as
u(xi(K)) = toFeatures(xi(K)), and the feature vectors of all training inputs can be computed in one line
as
lu(x1(K))⊤ 」
toFeatures(X_tr) = ' ' .
[u(xN(K)tr )⊤l
The features of the test inputs can be similarly computed as toFeatures(X_te).
Question 0 [no points]
This question is optional and does not have any impact on your grade.
Before implementing machine learning models to detect the author of the given sentence, let’s first take a look at the data. By running the section corresponding to “Question 0” in the given “run_coursework.m” file, a series of 10-word sentences corresponding to the the first 10 test covariates
{xi(K)}Ntr(r+1)1 will be displayed. Without looking at the corresponding test targets, try to guess who
is author for each sentence, and report your answers in the function “question0.m” . In what follows, we will try to build two different machine learning models which (hopefully) can surpass human performance.
3 Part I: logistic regression
In this first section, we will implement a logistic regression classifier. Accordingly, we will denote as θ
its K ×1 parameter vector, and a soft prediction will be given asp(ti = 1|xi(K),θ) = σ(θ⊤u(xi(K))), where
σ(x) =
represents the sigmoid function applied to a real number x ∈ ❘. The MATLAB implementation of the sigmoid function is given in the file “sigmoid.m”, where the function sigmoid(X) applies σ( ·) element-wise to any matrix, vector or scalar X.
Question 1 [10 points]
Complete the function “logisticSoftPrediction” which takes as input a data matrix XK = [(xi1(K))⊤ , ...(xin(K))⊤]⊤
and a parameter vector θ, and returns the vector of element-wise soft predictions ˆ(p) = [p(ti1 =
1|xi1(K) ,θ),..., p(tin = 1|xin(K) ,θ)]⊤ .
Function file: “logisticSoftPrediction.m”
Input format:
. n × K data matrix: X = X K
. K × 1 parameter vector: theta = θ
Output format: n × 1 vector of element-wise soft predictions: soft_predictions =ˆ(p)
Function signature: function soft_predictions = logisticSoftPrediction(X, theta)
Question 2 [10 points]
Complete the function “crossEntropyLoss” which takes as input the soft predictions ˆ(p) = [p(ti1 =
1|xi1(K),θ),..., p(tin = 1|xin(K) ,θ)]⊤ and the corresponding binary target vector t = [ti1 ,..., tin ]⊤ ; and returns
the cross-entropy loss
L(θ) =
Σ ℓi (θ),
i∈{i1 ,...,in }
where
ℓi (θ) = −log (p(ti = ti |xi(K) ,θ))
= −ti log (p(ti = 1|xi(K),θ))− (1 − ti )log (1 − p(ti = 1|xi(K),θ)) ,
is the log-loss of the i-th datapoint, for i ∈ {1,..., N}.
Function file: “crossEntropyLoss.m”
Input format:
. n × 1 vector of element-wise soft predictions: soft_predictions =ˆ(p)
. n × 1 binary target vector: target = t
Output format: scalar cross-entropy loss: loss = L(θ)
Function signature: function loss = crossEntropyLoss(soft_predictions, target)
Question 3 [10 points]
Complete the function “logisticGradient” which takes as input a data matrix XK = [(xi1(K))⊤ , ...(xin(K))⊤]⊤ ,
its corresponding binary target vector t = [ti1 ,..., tin ]⊤ , and the soft predictions ˆ(p) = [p(ti1 = 1|xi1(K),θ),..., p(tin = 1|xin(K),θ)]⊤ , and returns the matrix of element-wise gradients
∇L(θ) = Σ ∇ℓi (θ),
i∈{i1 ,...,in }
where ∇L(θ) denotes the K ×1 gradient of L(θ) with respect to the parameter vector θ. The gradient must be computed using symbolic differentiation (and not using numerical differentiation). The gradient is assumed to be evaluated at the same parameter vector θ used to generate the soft prediction ˆ(p).
Function file: “logisticGradient.m”
Input format:
. n × K data matrix: X = X K
. n × 1 binary target vector: t = t
. n × 1 vector of element-wise soft predictions: soft_predictions =ˆ(p)
Output format: K × 1 gradient vector: gradient = ∇L(θ)
Function signature: function gradient = logisticGradient(X, t, soft_predictions)
Question 4 [10 points]
Complete the function “gradientDescentStep” which takes as input the gradient ∇L(θ(l)) evaluated at the current parameter iterate θ(l) (also given as input) and a learning rate γ > 0; and outputs the next parameter iterate θ(i+1) by applying the gradient descent step
θ (l+1) = θ (l) − γ∇L(θ(l)).
Function file: “gradientDescentStep.m”
Input format:
. scalar learning rate: gamma = γ
. K × 1 current parameter vector iterate: theta = θ (l)
. K × 1 gradient vector: gradient = ∇L(θ)
Output format: K × 1 next parameter vector iterate: next_theta = θ (l+1)
Function signature: function next_theta = gradientDescentStep(gamma, theta, gradient)
Question 5 [10 points]
Complete the function “hardPrediction” which takes as input the soft predictions ˆ(p) = [p(ti1 =
1|xi1(K) ,θ),..., p(tin = 1|xin(K) ,θ)]⊤ and returns the hard predictions t(ˆ) = [t(ˆ)(xi1(K) ,θ), ..., t(ˆ)(xin(K) ,θ)]⊤ , where
t(ˆ)(xi(K),θ) = { 0(1)
for i ∈ {i1 ,..., in }.
if p(ti = 1|xi(K) ,θ) > 0.5
otherwise
,
Function file: “hardPrediction.m”
Input format: n × 1 vector of element-wise soft predictions: soft_predictions =ˆ(p)
Output format: n × 1 vector of element-wise hard predictions: hard_predictions =t(ˆ)
Function signature: function hard_predictions = hardPrediction(soft_predictions)
Before moving to the next section, you can test your implementation of logistic regression training by running the code provided after Question 5 in “run_coursework.m” . The code should output the training and test cross-entropy loss of the trained model, as well as its accuracy (ratio of correct predictions over the total number of data points).
We will come back to the implemented logistic regression model in the last (optional) code section of the file “run_coursework.m”, where the predictions of a trained logistic regression model will be compared against your proposed answers from Question 0.
4 Part II: neural network
In this second section, we will implement a neural network with a single feature extraction layer of L = 5 neurons using a leaky ReLu activation function
h(a) = max(αa,a) = max(a,0) + αmin(a,0),
for α ∈ [0, 1]. Accordingly, the parameters to be learned are θ = {Wfeat ,Wclass }, where Wfeat denotes the L × K matrix of weights for the feature extraction layer, and Wclass denotes the 1 × L row-vector
of weights for the classification layer. For a given K × 1 covariate vector xi(K), the neural network takes
its feature vector u(xi(K)) as input, and computes a soft prediction as follows:
afeat = Wfeat u(xi(K))
hfeat = h(afeat )
aclass = Wclass hfeat
p(ti = 1|xi(K) ,θ) = σ(aclass ),
where afeat is the L×1 vector of pre-activations for the feature extraction layer, hfeat denotes the L×1 output of the feature extraction layer, aclass is the scalar pre-activation of the classification layer, and
p(ti = 1|xi(K),θ) is the output of the neural network, where σ( ·) denotes the sigmoid function defined
at the beginning of part I. Note that the activation function h( ·) is applied element-wise to the vector afeat in the equations above.
Question 6: Forward Pass [15 points in total]
In the following questions, we will implement the forward pass of the neural network step by step.
Question 6.1 [5 points]
Complete the function “neuralNetworkPreActivation” which takes the input h and the weights
W of any of the two defined layers, i.e. (h, W) ∈ {(u(xi(K)),Wfeat ), (hfeat ,Wclass )}, and returns the
corresponding pre-activation vector a = Wh, where a = afeat if (h, W) = (u(xi(K)),Wfeat ),and a = aclass
if (h, W) = (hfeat ,Wclass ).
Function file: “neuralNetworkPreActivation.m”
Input format:
. Lin × 1 vector representing the layer input (Lin ∈ {K, L}): layer_input = h
. Lout × Lin matrix representing the layer weights (Lout ∈ {L,1}): layer_weights = W
Output format: Lout × 1 vector of pre-activations: pre_activations = a
Function signature: function pre_activations = neuralNetworkPreActivation(layer_input, layer_weights )
Question 6.2 [5 points]
Complete the function “neuralNetworkActivation” which takes as input pre-activation vector afeat and the scalar α ∈ [0, 1], and returns the output of the feature extraction layer hfeat = h(afeat ), where h( ·) is applied element-wise to the vector afeat.
Function file: “neuralNetworkActivation.m”
Input format:
. Leaky ReLu scalar coefficient: alpha = α
. L × 1 vector of pre-activations: pre_activations = afeat
Output format: L×1 vector of element-wise activations (i.e., layer output): layer_output = hfeat
Function signature: function layer_output = neuralNetworkActivation(alpha, pre_activations)
Question 6.3 [5 points]
Complete the function “neuralNetworkForwardPass” which takes as input a covariate embedding
feature vector u(xi(K)), the layer weights Wfeat and Wclass , and a the leaky ReLu coefficient α ∈ [0, 1];
and returns the neural network soft prediction p(ti = 1|xi(K),θ) for θ = {Wfeat ,Wclass }.
Function file: “neuralNetworkForwardPass.m”
Input format:
. Leaky ReLu scalar coefficient: alpha = α
. K × 1 input vector for the feature extraction layer : embedded_input = u(xi(K)) . L × K weights matrix of the feature extraction layer: weights_feat = Wfeat
. 1 × L weights row-vector of the classification layer: weights_class = Wclass
Output format: soft prediction scalar value: soft_prediction = p(ti = 1|xi(K),θ)
Function signature: function soft_prediction = neuralNetworkForwardPass(alpha, embedded_input, weights_feat, weights_class)
Question 7: Backward Pass [25 points in total]
In the following questions, we will implement the backward pass of the neural network step by step.
The objective is to compute the derivatives of the log-loss ℓi (θ) = −log(p(ti = ti |xi(K),θ)) with respect
to each element of θ = {Wfeat ,Wclass } for a given datapoint (xi(K), ti ) ∈ D. Accordingly, we will write
the derivatives of ℓi (θ) with respect to each weight of the matrix Wfeat = [wl(a)t](k,l)e号1,...,K}根号1,...,L}
as the L × K matrix
...
∇Wfeat ℓi (θ) = ' . . . ' .
...
Similarly, we define the 1 × L row vector ∇W class ℓi (θ) of the derivatives of ℓi (θ) with respect to the
elements of Wclass. Note that ℓi (θ) is sometimes (equivalently) expressed as ℓi (θ) = −log(σ(ti(干)afeat ))
in the lecture slides.
We denote as
δclass =
the scalar backpropagation error of the classification layer, and as
δfeat = , ..., 」
the L × 1 backpropagation error vector of the feature extraction layer, for aclass = [a1(c)lass ,..., aL(cl)ass]」.
Following the backward version of the neural network computational graph, these two quantities can be related as
δfeat = h′ (afeat ) ⊙((Wclass )⊤ δclass, ,
where h′ (afeat ) = dh(afeat )/dafeat is the derivative of the activation function h( ·) evaluated element-wise at afeat , and ⊙ denotes the element-wise multiplication of two vectors.
2023-12-19