关键词 > MATLAB代写

Coursework 1

发布时间：2023-10-28

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Coursework 1

Guidelines

Setting up the coursework

To start, download the file "cw1.zip" from the module’s Keats website. Once this is done:

1. Unzip the file "cw1.zip" in a folder of your choice. We will refer to this folder as ""

2. Change the the name of your unzipped folder to your k-number. For instance, if your k number is "k12345678", the file "run_coursework.m" should be located at "/k12345678/run_coursework.m"

3. Open your MATLAB editor, and make sure that the file explorer (upper left section of your editor) is located at "/k12345678/" (otherwise, the code will not run).

Instructions

You should now find two matlab scripts ("run_coursework.m" and "check_dimensions.m"), a series of MATLAB function files, two ".mat" files, and this file ("coursewor_1.pdf").

In this coursework, you will complete a series of functions that will be called in the script "run_coursework.m". By running the file "run_coursework.m", you will be able to inspect the results of the functions you modified.

You should modify ONLY the code inside the mentioned functions at each question. Most importantly, you should NOT edit the scripts "run_coursework.m" or "check_dimensions.m"!

The functions to edit will be indicated at the end of each question. We will also specify the format of the inputs and the required format of the outputs.

For instance, for a function named "example", which sums the entries of a vector v, the question will specify the following:

----

Function file: "example.m"

Input format: N×1 vector v

Output format: scalar value sum_v

Function signature (name of the inputs and outputs in the code):

function sum_v = example(v)

----

The file "example.m" will then contain the following

function out = example(v)

% Write your code here

out = rand(1); % Placeholder (to delete and replace upon completion of the question)

end

By default, each function will have placeholder output ("out = rand(1)" in this example) that ensures that the code

runs, even if the result is purely random. If you skip a question, you should leave the placeholder to ensure the file

"run_coursework.m" still runs. If you complete a question, you will have to remove the placeholder and replace it with

your answer.

You ONLY need to modify the code INSIDE the function. You must NOT modify the name of the function, and you must NOT modify the signature of the the function (i.e., the name of the inputs/outputs)!

Once a function has been coded, you should run the script "run_coursework.m" to make sure the function does the

intended operation. Note that the file "run_coursework.m" will run all of the coursework at once, so you might want to run only the lines of "run_coursework.m" up to your current question. Comments in "run_coursework.m" indicate which lines concern which questions.

You are not allowed to use MATLAB toolboxes: the coded functions should only contain built-in MATLAB functions such as the ones seen in lectures or tutorials (e.g., "mean", "sum", "*", ".*", "binornd", ...).

Please avoid submitting functions that display text. It is recommended to use the MATLAB debugger instead of function "disp" to inspect the behaviour of your code.

Submitting your coursework

Before submitting your work, you should clear your workspace (right click on the "Workspace" section of your editor > "Clear Workspace") and verify that the script "run_coursework.m" runs well and gives the intended results.

You should also run the file "dimension_check.m" to make sure that each the outputs of each function have the right dimensions.

No points will be awarded for functions that do not output the right dimensions or to functions that raise an error.

To submit your work, compress the folder containing the MATLAB files into a ZIP file with your k-number as its name. For instance, if your k-number is "k12345678", the ZIP file should be "k12345678.zip". Please verify that the ZIP file directly contains your MATLAB files, and not an intermediary folder.

Finally, submit the your ZIP file over KEATS.

Introduction

This coursework will explore how machine learning can be used to analyze text data. It is divided in two parts:

• Part I: prediction of the next word given the previous word in a sentence based on a given model;

• Part II: training of a classifier that can predict the next word based on the k≥ 1 previous words.

A written sentence can be viewed as a sequence of words. We will use the notation sk to denote a sequence of K words. For instance, the sentence "Hello, my name is Sam." is given by the 5-words sequence (disregarding the punctuation and upper/lower cases):

s5=["hello" ，"my" ，"name "，"is "，"sam" .

Note that two different orderings of the same words represent two distinct sequences.

For instance, ["my", "name", "is", "sam"] and ["sam", "is", "my" ,"name"] are two different sentences.

The words composing a sentence are taken from a discrete vocabulary set v=wn ofM different words wm for

m∈ {1…M .

For instance, with the vocabulary set V=[wm=i"sam","i","helo","is","my","name","am"] , we can create

the sequences:

s5=[w3 ,w5，w，w4 ,w1]=["hello" ，"my "，"name"，"is "，"sam"]

and

s3= [w2,w7 ,w1]=["i" ,"am ","sam" .

In order to represents theM words of the vocabulary as numbers we define the discrete set

, where xe represents the word W in V.

For instance, if the vocabulary is V=w="my","name" ,"s"，"am，"""am" , the sequence

s4=[w1 ，w2 ,w3,w4]= ["my" ,"name ","is ","sam"] can be expressed as the vector x=[1 ,2 ,3 ,4] , and the

sequence s3=[w5,ww4]=["i" ,"am ","sam"] can be represented by the vector x=[5,6,4 .

From a probabilistiw(p)h(e)e(r)r(s)e(p)t(e)he(cti)vn(a)do(se)m(q)va(ue)r(n)iab(ce)skle ofrepr(K w)e(o)s(r)en(ds)t(i)s(s)th(m)e(o) e-t(l)h(le)w(d)o(a)r(s)d(a)in(d)iscreFo(te)rk×1 random ct(t)he(or) random

variable takes values in the set , and a realization {x= represents the word W in the vocabulary V.

For instance, a realization {x3= [5,6,4]} represents the sequence s3=[w5 ,w，w4]=["i" ，"am ","sam"] , if the

vocabulary is .

Throughout this coursework, we will work with a vocabulary V of M= 10 words given as

V = [

"it", "is", "a", "the", "nice", ...

"good", "day", "evening", "not", "or" ...

];

Part I

In the first part of this coursework, we will analyze sequences of two words taken from the vocabulary set V.

Accordingly, each 2-words sequence will be represented by a random vector x2= [x1,x2] , where and X2 take values

in the set X= {1,..,10} .

Throughout this part, some functions will take as input the MXM matrix

, such that the element at the i-th row and j-th colum of the matrix represents the joint probability P(x1=i,x2=j) for and .

Question 1 [10 points]

Complete the function marginalx1(point) that takes as input the matrix pjoint defined above, and returns the marginal

probability distribution P(x)= [p(x1= 1),…p(x1= 10) .

----

Function file: "marginalx1.m"

Input format: MXM matrix pjoint

Output format: MX 1 vector P(x1) (denoted as px1 in the code)

Function signature:

function px1 = marginalx1(P_joint)

----

Question 2 [10 points]

Complete the function probNextword(podmt，p(x1)) that takes as input the matrix pjoint at the begining of Part I and

the marginal probability distribution P(x1) defined in Question 1, and returns the conditional probability distribution

P(x2|x1) as the MXM matrix .

----

Function file: "probNextWord.m"

Input format:

• MXM matrix pjoint

• M× 1 vector P(x1) (denoted as px1 in the code)

Output format: MXM matrix pcond

Function signature:

function P_cond = probNextWord(P_joint, px1)

----

Question 3 [10 points]

Complete the function sampleNextword(pond,x1) that takes as inputs the matrix pcond defined in Question 2 and a

realization of , and returns a realization of the next word X2 given {x1=x} , i.e., a sample x2～p(x2|1=x1) .

Note that X2 takes values in {1……M} only.

----

Function file: "sampleNextWord.m"

Input format:

• MXM matrix pcond,

• scalar value ;

Output format: scalar value

Function signature:

function x2 = sampleNextWord(P_cond, x1)

----

Question 4 [5 points]

Complete the function samplesequence(pond,x1，k) that takes as inputs:

• the matrix pcond defined in Question 2,

• a realization of the first word of the sequence ,

• the number K≥ 2 of words in the sequence;

and returns a K× 1 vector corresponding to a realization of the random vector xk=[x…xx]

given {x1=x} .

We assume here that the distribution of each word Xk only depends on its previous word Xk-1

for k≥2 , i.e., .

----

Function file: "sampleSequence.m"

Input format:

• MXM matrix pcond,

• scalar value

• integer K≥2;

Output format: K×1 vector xk

Function signature:

function x_K = sampleSequence(P_cond, x1, K)

----

Question 5 [5 points]

Complete the function

that takes as inputs:

the 1×M vocabulary row-vector ,

• a K× 1 vector with ;

and returns the 1×k sequence of words s=[w……w represented by the vector .

You can use the MATLAB function strings(1,k) to initialize a 1×k row-vector with empty strings.

----

Function file: "sequenceToWords.m"

Input format:

• 1×M row-vector V,

• K×1 vector xk;

Output format: 1×k row-vector of text values sk

Function signature:

function s_K = sequenceToWords(V, x_K)

----

Part II

In this second part, we are given a dataset of N sentences =… of length K, where is the k-th word in the n-th sentence, for ke{1,...,k} and . As explained in the

introduction, the integer xn,k represents the word in the given vocabulary v=wn of M＝ 10 words.

The objective of this part will be to train a hard predictor () with k×1 parameter vector a capable of predicting the next word based on the k previous words represented by the k×1 vector =… ,

with k∈ {1,...,k- 1} . For this, we will use a fraction of the available dataset to train the hard predictor

that minimizes the mean squared error (MSE), where the function rounda) takes the nearest integers of a real number aeR . The remaining fraction of the dataset will be used to assess the performance of the trained predictor.

Accordingly, for a given , we will regroup all the predictor inputs available in the dataset into

a NXK input matrix , where the k×1 vector represents the k first words of the

n-th sentence, for n∈1……N . The corresponding predictor targets (i.e., the next word after ) are grouped into an

NX 1 target vector 6= [+……xw+] , where xn,k+1 represents the (k＋1)-th word of the n-th sentence. We

will also use the notation to refer to the data matrix containing all of dataset .

We provide a function which takes as input a n×k input data matrix and

its corresponding targets as a n×1 vector tk, for any number of rows ; and outputs the optimal

k×1 parameter vector a of the predictor () with respect to the MSE. This function can be found in the file "leastSquaresSolver.m" and its function signature is:

function theta_k = leastSquaresSolver(X_k, t_k)

This function can be called at any point in the code to obtain the optimal parameter vector .

Question 6 [10 points]

Complete the function splitDataseTrainTest(x",r) that takes as inputs:

• the NXK data matrix Xk corresponding to the entire dataset defined at the begining of Part II,

• a scalar value r∈[0,1] representing the train/test ratio split;

and outputs the training set = as the NXK training data matrix , where NH=round(rN) , and the test dataset as the Ne×k test data matrix , where Nte=N-Nt. Note that the training set

will containt the first Nr rows of the matrix Xk, i.e., the rows ranging from 1 to Nt (included), while the test set will

contain the remaining Ne rows of X, i.e., the rows of ranging from N+1 to N.

This partition must not involve any randomness or re-ordering of the rows, and it must use the function "round" available in MATLAB.

----

Function file: "splitDatasetTrainTest.m"

Input format:

• N×k input matrix Xk (denoted as X in the code),

scalar ;

Output format (in order):

• Nr×k training data matrix ,

Ne×k test data matrix ;

Function signature:

function [X_tr, X_te] = splitDatasetTrainTest(X, r)

----

Question 7 [10 points]

Complete the function splitInputTarget(x《,k) which takes as input:

• nxk input data matrix Xk, for ,

• integer k= {1,...,k} corresponding to the number of words to select in each sentence;

and outputs the nxk input matrix x* corresponding to the first k columns of the nxk data matrix Xk (i.e., the

columns of Xk ranging from 1 to k included), and the n×1 target vector corresponding to the (k＋1)-th column of XK.

----

Function file: "splitInputTarget.m"

Input format:

• nxk data matrix Xk (denoted as X in the code),

• scalar value k∈ {1,...,k- 1}

Output format:

• n×k input matrix corresponding to the first k words of each row in Xk,

• n×1 target vector tk corresponding to the (k+1)-th column of Xk;

Function signature:

function [X_k, t_k] = splitInputTarget(X, k)

----

Question 8 [10 points]

Complete the function rowwiseInnerproduct(x#,k) which takes as input:

• a nxk matrix Xk composed of the first k columns of Xk,

• a k×1 parameter vector ;

and outputs the n×1 vector corresponding of the inner product of each row in Xk with θk, i.e., where the i-th

element in corresponds to the inner product , for i∈{1,...,n}.

----

Function file: "rowWiseInnerProduct.m"

Input format:

• nxk matrix Xk

• a k×1 parameter vector ;

Output format: n× 1 vector Ok

Function signature:

function o_k = rowWiseInnerProduct(X_k, theta_k)

----

Question 9 [10 points]

Complete the function predictNextword(ox) which takes as input the number M of words in the vocabulary

Vand a n×1 vector =…6 corresponding of the inner products of a k×1 parameter

vector with n sentences, represented as k×1 vectors , for i∈ {1,...,n} ; and outputs the n×1 vector

representing the outputs of the predictor () for each input sentence .

----

Function file: "predictNextWord.m"

Input format:

• scalar M,

•

Output format: n×1 vector

Function signature:

function t_hat_k = predictNextWord(M, o_k)

----

Question 10 [10 points]

Complete the function which takes as input a n×1 vector of predicted targets and a n×1 vector of true targets t, and outputs the scalar value corresponding to the mean squared error (MSE).