Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH6168W1

SEMESTER 2 EXAMINATION 2023/24

Math6168 Machine Learning: Coursework 1

Due Date: 11pm on Thu. 21 March in Week 8.

Worth: 100 marks (worth 20% of the final result).

(a) Handed in online on the module Math6168 Blackboard by the dead-line specified above.

(b) Standard university guidelines will be followed for late coursework.

(c) All coursework must be carried out and written up independently. Standard School of Mathematics guidelines will be used to detect excessive collaboration and plagiarism, and appropriate penalties will be issued if required. All suspicious cases will be referred to the academic integrity officer!

(d) The page limit (specificed in question(s)) is strict and is easily suf-ficient to receive full credit. All materials related to a question, in-cluding plots and any appendices, must fall within these limits. You do not have to (i.e.  may or may not) submit the computer code which you used for the analysis (unless requested in question), but you should explain clearly what analysis has been done with justi-fication.  Marks will be deducted for work which exceeds the page limits. If you have too much material then you need to decide what is important and what can be left out.

(e) The questions involve the modelling of real data. There is not nec-essarily a single ‘correct’ answer. Submissions which demonstrate a good appreciation of statistical modelling principles, together with correct application of appropriate methods will receive high marks.

1.    [Total 100 marks, 5 sided A4 pages maximum]

The data in the file WeeklyPart. txt, available on Blackboard, provide weekly per- centage returns for the S&P 500 stock index between 1990 and 2010.  In the data frame with 1089 observations, there are 7 variables including Lag1 (Percentage return for previous week), Lag2 (Percentage return for 2 weeks previous), Lag3 (Percentage return for 3 weeks previous), Lag4 (Percentage return for 4 weeks previous), Lag5 (Percentage return for 5 weeks previous), Volume (Volume of shares traded (average number of daily shares traded in billions)), Today (Percentage return for this week).

(A) Describe how you develop a linear model to identify whether the Today (percent- age return for this week) depends on other variables, showing which of these other variables are significant. Explain and build your optimal linear model by the idea of backwardselection. Give your careful justification and reasoning. [25 marks]

(B) In the following questions you will develop a model to predict whether the Today (percentage return for this week) has positive or negative percentage return based on the WeeklyPart data set.

(a) Define a binary variable, Today01, that takes the value 1 if a value of Today is above zero, and 0 otherwise, and write R code to create this variable.  Consider splitting the data into a training set and a test set by a proportion of 75% and 25%, respectively. Describe how you do the splitting in R.      [10 marks]

(b) Describe how you investigate the association between Today01 and the other features in WeeklyPart by exploring the training data graphically. Explain with jus- tification which of these other features seem most likely to be useful in predicting

Today01. Give your reasoning in answer.                         [10 marks]

(c) Describe, with explanation or justification, how you perform logistic regression on the training data in order to predict Today01 using the variables that seemed most associated with Today01 in (b).  What is the test error (misclassification) of the model obtained?       [15 marks] (d) Explain the assumptions for LDA, and describe, with explanation or justifica- tion, how you perform LDA on the training data in order to predict Today01 using the variables that seemed most associated with Today01 in (b).  What is the test error (misclassification) of the model obtained?       [15 marks] 

(e) Explain the assumptions for QDA, and describe, with explanation or justifica- tion, how you would like to perform QDA on the training data in order to predict Today01 using the variables that seemed most associated with Today01 in (b). What is the test error (misclassification) of the model obtained?            [15 marks]

(f) Compare the findings in terms of error rates from the three models in (c), (d) and (e), and explain the differences and respective advantages among the LDA, the QDA and the logistic regression for classification.       [10 marks]