闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT0030 Assessment 2 — Instructions

For this assessment you should submit online – on the course Moodle page using the link “ICA2: Click here to submit your assignment”. Make sure none of the ﬁles contains your surname, as the marking must be anonymous. You must submit two ﬁles:

● An electronic copy of your StudentNumber.rmd ﬁle, containing your R markdown code. For example, if your student number is 18239004, your R markdown script should be saved in the ﬁle 18239004.rmd.

● A single PDF ﬁle named StudentNumber.pdf containing the knitted output of the Rmarkdown ﬁle. This should correspond exactly to what is produced when knitting the submitted .rmd ﬁle.

Any output within your pdf should be clearly presented and structured according to the question parts. Your report (including the graphics but excluding the hidden code) should not exceed 5 pages.

STAT0030 Assessment 2 – Marking guidelines

The assessment is marked out of 40. The marks are roughly subdivided into the following components.

1. Exploratory analysis (5 marks): investigation and commentary of initial statistical prop- erties, relationships, and anything of note which helps justify your choice of graphs and modelling strategy.

2. Graphical presentation (5 marks): appropriate choice of graphs and formatting.

3. Modelling strategy (10 marks): marks here will be based on a structured, justiﬁed, well- principled approach with clear and concise discussion.

4. Interpretation of ﬁnal model (10 marks): comparison of the two ﬁnal models and com- mentary on their quality.

5. Quality of the code (10 marks): your code should be clean, readable (with suﬃcient commenting for the user) and eﬃcient.

STAT0030 Assessment 2 — Questions

1 Introduction to Ridge Regression

In Lab 4 we found out how to ﬁt linear models in R. Recall that linear models take the form

yi = β0 + β1 xi1 + . . . + βp xip + ii ,

for i = 1, . . . , n. Here,

● yi is the value of the response (or dependent) variable for the ith case in the dataset.

● xij is the value of the jth explanatory variable or covariate for that case.

● β0 , . . . , βp are parameters.

● i1 , . . . , in are independent error terms with zero mean, assumed to have constant variance and to be normally distributed (unless otherwise stated).

The coeﬃcients β0 , β1 , . . . , βp are usually estimated by minimising the residual sum of

squares,

RSS = (yi _ β0 _ β1 xi1 _ . . . _ βp xip )2 .

i=1

The resulting estimator is called the Least Squares Estimator, which also happens to be the maximum likelihood estimator in the case when the errors are assumed to be normally distributed with variance σ 2 .

However, the least squares estimator can overﬁt, especially when a large number of covari- ates are used. This means that the linear ﬁt will pick up random noise in the observed data and will not be a good predictor of future observations. The simplest way of dealing with this issue is best subset selection, where any possible subset of the covariates is used to ﬁt a linear model and compared in terms of some model selection criterion (for example, using the Akaike Information Criterion or predictive power through cross-validation). Unfortunately, best subset selection is computationally prohibitive for large numbers of covariates. Instead, stepwise regression is often used, where covariates are iteratively added or removed from the model according to their p-value. However, stepwise regression is sensitive to the order in which covariates are added or removed and is not guaranteed to result in the best overall subset of covariates.

An alternative approach to this problem is penalised regression. In penalised regression, the objective function includes a penalty term to the residual sum of squares which represents a “cost” for large values of regression coeﬃcients. The simplest form of penalisation is the L2 norm of the coeﬃcient vector, also called Ridge penalty. The loss function in Ridge Regression is then given by

n p

Lridge = (yi _ β0 _ β1 xi1 _ . . . _ βp xip )2 + λ βi(2) ,

i=1 i=1

which is minimised with respect to the coeﬃcient vector β to obtain penalised parameter estimates. Here, λ is a tuning parameter which represents the level of regularisation: a value of λ = 0 represents no regularisation (resulting in the standard least squares estimators), whereas a value of λ = x corresponds to total regularisation, i.e., all coeﬃcients except β0 are forced to 0. The optimal value of λ is typically chosen by setting up a grid of s values λ 1 , . . . , λs and computing cross-validated performance (for example, in terms of mean squared error) for the model ﬁt resulting from each value of λ. The corresponding ﬁt will be sensitive to the scale of each covariate, thus scaling of the covariates prior to model ﬁtting is often applied. One key advantage of the ridge penalty is that the optimisation of the loss function is still convex and thus computationally simple. One disadvantage is that ridge regression will always include all covariates in the model, so it does not perform any variable selection.

Your task for this assignment will be to write R code to compute ridge regression coeﬃcient estimates for a given dataset (see detais below), and apply it to a dataset of Covid-19 case numbers from the UK’s ﬁrst pandemic wave (see details below).

2 Covid-19 data overview

When the Covid-19 pandemic was ﬁrst recognised in early 2020, it quickly became apparent that age was the main risk factor for becoming seriously ill or dying from the disease. Researchers have also identiﬁed other risk factors including gender, social deprivation, pre-existing health conditions and ethnicity.o Understanding these risk factors can potentially help to develop strategies for reducing deaths, for example by targeting appropriate healthcare resources in areas that need them the most. In the UK, the Oﬃce for National Statistics (ONS) publishes a variety of information on Covid. An ONS report from August 2020l produced a simple analysis of Covid death rates across England and Wales, between March and July 2020. In this assessment we will examine more closely the data used in that report and try to understand why some areas have more deaths than others, by linking to UK Census data on the socio-economic characteristics of the diﬀerent areas.

We will use data consisting of the total numbers of reported deaths in the period March– July 2020, where Covid-19 was given as the cause of death, for each “Middle Layer Super Output Areas” (MSOAs) in England and Wales. According to the ONS report cited above, Super Output Areas are “small-area statistical geographies covering England and Wales”, each of which has a similarly sized population and remains stable over time. These data are from the ONS web site.2 They have been combined with demographic and socioeconomic data from the most recent UK Census in 2011, obtained by querying datasets at the Nomis Labour Market Statistics service; and also with some geographic information from the UK’s Open Geography

Portal.

o See, for example, Williamson et al. (2020): “Factors associated with COVID-19-related death using OpenSAFELY” (Nature 584, pp. 430–436).

l ONS Statistical Bulletin“Deaths involving COVID-19 by local area and socioeconomic deprivation: deaths occurring between 1 March and 31 July 2020”, published August 2020.

2 Here and elsewhere, clicking on the blue text will take you to the relevant web site.

The data are provided in the ﬁle UKCovid1STAT0030.csv, available from the ‘In-course as- sessment 2’ section of the STAT0030 Moodle page. This contains a subsampled and anonymised version of the original data. Full details can be found in the Appendix to these instructions.

Your task in this assessment is to use the data of these 5 401 records, to build a statistical model that will help you understand the social, demographic and economic factors associated with variation between MSOAs in numbers of Covid deaths during the period March–July 2020.

3 Instructions for the assessment

Your report should be structured according to the following 6 parts:

1. Write an R function called RidgeRegression with inputs a vector y of length n, a matrix X of size n - p, and a vector lambda of length s. Your function should iterate through each value of lambda; for each element of lambda, you should compute the ridge regression coeﬃcient estimates by minimising the ridge loss (you can use nlm for the optimisation but you cannot use any other in-built R ridge regression functions). Your function should output beta, a matrix of size s - (p + 1) containing the estimated coeﬃcients βˆ0 , βˆ1 , . . . , βˆp corresponding to the ridge regression ﬁt using each element of lambda.

2. Load the Covid-19 data. Obtain summary statistics and make useful plots of the data — i.e., that are relevant to the objectives of the study. Such plots might include, but are not necessarily restricted to, pairwise scatter plots for quantitative variables with diﬀerent plotting symbols or colours. Put plots together in a single ﬁgure where appropriate and consider the possibility of using log scales.

3. Use your data exploration above to remove covariates according to your judgement, carefully justifying your choices. After scaling your ﬁnal set of covariates, use your RidgeRegression function to obtain penalised parameter estimates for a linear model predicting MSOA death rates, using the following set of λ values: (10, 1, 0.1, 0.01, 0.001). Although there will certainly be scope for model improvement by applying covariate trans- formations, you are advised against this for this assessment - but you may wish to com- ment on it in your ﬁnal section.

4. Find an advanced regression model (for example, using gradient boosting or random forests, see Lab 7) to predict MSOA death rates using the available covariates. You are encouraged to consider a variety of models, but ultimately you are required to recommend a single model from this family. You may use a variety of criteria to decide on your model, including cross-validated predictive performance (or out-of-bag evaluation in the case of random forests). Clearly explain your reasoning and choices.

5. Perform 10-fold cross validation to compute the cross-validated Root Mean Square Error

(RMSE) of each of your models in parts 3 and 4. For each of your folds, ﬁt your six models (one for each λ in part 3 and one from part 4) on the data from that fold,

compute the “held-out” RMSE for each of your models, so that you obtain 10 sets of RMSE values. Perform a paired t-test to assess whether your advanced regression model results in better out-of-sample RMSE than the “best” ridge regression model.

6. Discuss the advantages and disadvantages of each of the models.

Your .rmd ﬁle should include all your code but you should use the option echo = FALSE so that your code does not appear in the knitted report. You do not need to include all your output and graphics. Instead, include whatever details and output you think are important to your model building and conclusions. You can control whether any output from a code chunk is included or excluded from the knitted report using eval = TRUE and eval = FALSE in the R chunk options. Your report (including the graphics but excluding the hidden code) should not exceed 5 pages. Your report should be at a level that can be understood easily by somebody with an MSc in Statistics.

You are not allowed to use any packages for this assignment other than those included with ‘R-base’, or those included in the list of ‘R-recommended’ packages (https://cran. r-project.org/src/contrib/4.2.0/Recommended/), or those used in any of the work- shops: if you load a package, please note in a comment which one of these sources you used for the package. You will not receive marks for sections of your answer that use R packages not from one of those three categories.

STAT0030 Assessment 2 — General hints

1. In general, there is not a single ‘right’ answer to each question. To obtain a good mark you should approach the questions sensibly and justify what you’re doing. Credit will be given for code that is clear and readable, while code that is inadequately commented will be penalised. You might like to use scripts cosapprox.r (Lab 1) and tablet.r (Lab 3) as models.

2. The assessment is designed to test your ability to use the computer to learn about a real data set. This will be assessed not only on your computing skills, but also on your ability to carry out a sensible and informed statistical analysis: material from your other courses will be relevant here. To earn high marks for this question, you need to take a structured and critical approach to the analysis and to demonstrate appropriate judgement in your choice of material to present.

3. Marks will be deducted if your .pdf ﬁle does not correspond exactly to the results we obtain when we knit the .rmd. You should assume that the input ﬁle is available at the same location as your .rmd ﬁle.

4. More credit will usually be given for code that is more generally applicable, rather than tailored to a particular situation or set of data. For example, if you were asked to print out the mean age of a group of people, you could do either of the following:

● Calculate the mean before you write your ﬁnal script, and then insert a line

cat("Mean age is 25.3\n")

(or whatever the mean happens to be) into your script.

● In your script, create an object (say xbar) that holds the mean age, and then insert the line

cat(paste("Mean age is",xbar,"\n"))

into your script.

The second approach is clearly more general and will earn more credit, since it will work for other similar data also.

5. All graphs should be clearly and appropriately labelled (giving units of quantitative vari- ables), titled and formatted. By ‘appropriately formatted’ we mean, for example, that axis scales should be well chosen.

6. Your program should be well commented. If you have deﬁned functions, these should consist of a header section summarising the logical structure, followed by the main body of the script. The main body should itself contain comments.

7. Refer to the feedback you received on in-course assessment 1.