Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MTH6991/MTH791U/MTH791P

Computational Statistics with R

Exercise Sheet 9

Spring 2023

The questions in the first section are due to be handed in for assessment along with those in the first section from the previous  exercise sheet.  The link for submission will be in the week 11 section on QMPlus.  The deadline is 13:00 on Thursday the  20th  April, late submissions will receive zero marks.

Problems for handing in

Questions 1. and 2. use a dataset on QMPlus. For each student, there should be a file called “exercise9 XYZ.txt”, where XYZ is your ID number (you need to be logged in to QMPlus).  If you cannot see a file, please send me an email.

1.  (30 marks)

Hand in: an R script with all the R code used, plus a separate file with your solutions - write the solutions (briefly) in your own words rather than copying and pasting any console output from R.

The first five columns in the dataset are from patients following a procedure to treat a chronic condition.  The outcome variable is success, which 1 for success, 0 otherwise. We want to model the probability of success as a function of covariates.   The four possible covariates are:

•  sex: this is 1 for male, 0 for female;

• age, measured in years;

•  smoke: this is 1 for smoker, 0 for non-smoker;

• proc: this is 0 for an established procedure, and 1 for a new procedure. You are asked to do the following:

(a) Fit three logistic regression models in turn, each with only the intercept and one

covariate:  sex, age, and smoke (i.e., first model has the intercept and covariate sex; and so on). For each model, use leave-one-out cross-validation and calculate a likelihood-based measure of predictive ability. Which model appears to have the best predictive ability?

(b) Take the best predictive model from part 1.a, and add proc as another covariate.

Does adding this covariate seem to improve the predictive ability?  According to this model, which procedure has the higher chance of success?

2.  (35 marks) The dataset has two more variables, called“y”and“x”. We want to fit a flexible spline regression model of y against x using the sreg function.

• Find the optimal smoothing parameter (i.e., λ, in the notation of lecture 10 and practical 10).  Do this using cross-validation as implemented in practical 10 (or your own version of this), not by using the optimal λ that sreg can find.

NOTE: the optimal λ does not need to be found to high accuracy:  within 10% is enough (i.e., if the true optimal is λ*  and your solution is within the range (0.90 × λ* ,  1.10 × λ* ) it is fine).

What is the optimal λ? What is the PRESS statistic at this optimal λ?

• Plot the data-points, plus the model prediction that results from your optimal λ , on the same graph.

Additional problems

3) For a spline model of y against one covariate x, this question looks at the scale of the data and λ .  By considering the penalty term P in the penalized sum of squares SP , and also considering the sum of squares S, how would the optimal λ change if either:

(a) all values of the outcome y are multiplied by 10?

(b) all values of x are multiplied by 10?

Check what happens in both these cases using the diabetes data from practical 10.