Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Department of Mathematics and Statistics

ST2LM / ST2LMD Linear Models

Assignment 2

This assignment is worth 15% of the overall module mark for ST2LM and 7.5% of the overall module mark for ST2LMD.

You must use SAS for each question, making sure that you include SAS output in your solutions, alongside your interpretation and explanation of what you’re doing, and why.

When you are ready to submit, goto the Blackboard course for this module, and from the menu select Assessment. Click on Assignment 2. Click Browse My Computer and find the file you want to submit. Make sure this is the correct file! Once your file has been selected, click Submit (note – you  must click Submit because if you only click Save Draft your work will not be submitted).

More information on submitting work to Blackboard can be found athttps://sites.reading.ac.uk/tel- support-for-students/2018/08/31/blackboard-learn-a-students-guide-to-submitting-work/.

The deadline for handing in your solutions to this assignment is 12.00 (noon) UK time on

Monday 4th December 2023. This deadline will be strictly adhered to, unless you are granted an extension for reasons of exceptional circumstances by the School Director of Academic

Tutoring. Late work submitted up to one week after the deadline will be subject to the

University penalty of a deduction of 10% of the available marks per working day late - work submitted later than five working days after the deadline will not be looked at and will receive a mark of 0.

You will receive back your marked work, with feedback, within 15 working days of the original    deadline (or within 15 working days of submission if you hand in later than the original deadline). You will receive an email to inform you when your marked work is available in Blackboard.

You will be assessed on the correct choice of methods and your implementation of them in SAS in order to do a full and appropriate analysis of the data, and on the extent and accuracy of your interpretation of results.

You must include your SAS code at the end of your document (clearly marked as to which question it relates to).

Question 1 (you can attempt parts a - d now - part e will need Lecture 16)

The file cars.csv can be downloaded from Blackboard. This file contains data for a measure of engine (fuel) economy for 32 different American car models, together with engine volume (in3, x1) and rear axle ratio (number of turns of rear axle per single revolution of wheels, x2).

We would like to build a model to describe how engine economy varies according to features of the car, which we can then use to predict economy for other cars.

(a)     By considering appropriate plots, and other information, comment on the relationship between the variables. [4 marks]

(b)     Consider simple linear regressions of economy on a single explanatory variable. Which variable (x1  or x2) explains most of the variation in economy? Is this what you would expect, given the plots you produced above? [6 marks]

(c)     Fit the full model (regressing economy on both engine volume and rear axle ratio) and fully interpret the output: include a discussion of the results of the hypothesis tests for the coefficients of the model. Does this seem a more reasonable model to explain economy than the simple linear regression models you fitted above? Which model would you choose to be your final one? [12 marks]

(d)     Use your preferred regression model from (c) to predict the economy of a car with an engine volume of 300 in3  and rear axle ratio 2.9. What is the 95% prediction interval for the estimated economy value? [6 marks]

(e)     Produce appropriate plots of the standardised residuals from the model that you have chosen to be the best in part (c) above - summarise what each plot tells you about the suitability of the model you have fitted. [12 marks]

Question 2 (you will need material up to and including Lecture 16 to attempt this question)

Towards the end of this module we will cover the topic of sequential variable selection. This is a set of methods used to build a model containing only those explanatory variables that are necessary for a good model, out of a larger group of variables, and not containing those which are non-significant. One of the methods is called stepwise selection, which can be applied in SAS using the following code

PROC REG Data = yourdataname;

MODEL y = list of all numerical explanatory variables / SELECTION = STEPWISE

SLENTRY=0.1 SLSTAY=0.1;

RUN;

You can see an example of this code in action in Lecture 18, but you don’tneed to wait until then to tackle this question.

The final model produced by the method is that which contains the variables listed in the Variables Entered column which do not also appear in the Variables Removed column in the output table. For this assignment, use this code to obtain a list of the variables which are in your chosen model. And then use PROC GLM or PROC REG in the usual way to fit a model containing those variables.

Genetic variation among all species is fundamentally due to gene mutations occurring – a mutation event changes a gene from one type to a different type within an individual. Because of the importance of mutation, it is of interest to geneticists to be able to estimate the rate at which mutations occur, or more specifically, the scaled mutation rate parameter θ. This can be done by comparing genetic profiles of individuals within family trees (pedigrees), but is very expensive. It is therefore preferable to estimate mutation rates based on samples from a population, and your task is to investigate whether a suitable multiple regression model can be used for this purpose.

The file mutation.csv, downloadable from Blackboard, contains 78 pedigree-based observations of ln(θ) together with values for some population genetics summary statistics (S1, S2  and S3) thought to be related to the mutation rate. The 78 observations relate to 39 populations of a butterfly species, with θ having been calculated for males and females separately (accordingly, the file contains a column specifying sex, and an indicator variable version of this column called sex01).

Perform a full analysis of these data, taking ln(θ) as the response, explaining what you are doing and why, and interpreting all results fully. (Note that your ultimate goal is to find a suitable model for predicting ln(θ) based on only the most important explanatory variables. Remember to check that your final model is reasonable.) [60 marks]