闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Final Exam Term 2 2021

MATH5806

Applied Regression Analysis

1. [7 marks] General Quiz.

For each of the following questions, which is (are) the correct answer(s)?

i) [1 mark] A model with p explanatory variables was estimated and we obtained a value of the coeﬃcient of determination R2 denoted by Rp(2) . A new explanatory variable is now added to this model and, after ﬁtting a new model with p + 1 variable, Rp(2)+1 is calculated.

A. Rp(2)+1 is always greater (or equal) then Rp(2) .

B. Rp(2)+1 is sometimes smaller, sometimes larger, depending on the added variable.

C. Rp(2)+1 is always smaller (or equal) then Rp(2) .

ii) [1 mark] Lasso regression can be seen as a regression that uses the Least Squares estimation criterion, and a norm constraint on

A. the design matrix X.

B. the coeﬃcients.

C. none of the above.

iii) [1 mark] Which of the following statement(s) is (are) true?

A. The score statistic U can be interpreted as a random variable.

B. The expected value of the score statistic U is zero.

C. None of the above.

iv) [1 mark] Which of the following statement(s) is (are) true about regres- sion splines?

A. They are less ﬂexible than using polynomials and step functions.

B. The range of the random variable X is divided into K distinct regions and in each region ﬁt a polynomial function to the data.

C. The polynomials in regression splines join smoothly at the knots in contrast to step functions.

v) [1 mark] The bs() function from the splines library is used to generate the entire matrix of basis functions for splines. Which of the following function calls generates a basis matrix of cubic spline functions with 5 interior knots at the respective percentiles of x?

A. bs(x, df = 9)

B. bs(x, df = 8)

C. bs(x, df = 7)

D. None of the above.

vi) [1 mark] As seen in the course, model assessment refers to:

A. estimating the performance of diﬀerent models in order to choose the best one.

B. having chosen a ﬁnal model, estimating its training error on the dataset used for ﬁtting the model.

C. having chosen a ﬁnal model and estimating its test error on new data.

vii) [1 mark] Which of the following statement(s) is (are) true?

A. R2 is used for model selection from models with diﬀering number of variables: the higher R2 indicates better ﬁtting model.

B. Mallow’s Cp gives an estimate of ETTin .

C. For likelihood based models, Mallow’s Cp generalises to the Akaike Information Criterion.

D. AIC penalises complexity more strongly than BIC if N > 8.

2. [10 marks] Model selection.

A study was conducted into the eﬀect of car weight, accident type and ejec- tion of a driver on the severity of car accidents. Three categorical variables were recorded for a number of accidents: car weight (weight taking values “small” or ”standard”), ejection of the driver (ejected, values “yes” or “no”) and accident type (type, values “collision” or “rollover”). The eight possible combinations of values for the three (two-level) factors divide the accidents in the study into eight groups: for instance, one group consists of collisions which involved a small car in which the driver was ejected.

For each of the eight groups, a variable was recorded specifying the total number of accidents belonging to each group (total) and how many of these accidents were severe (severe). The variable y is deﬁned for each group as the fraction of severe accidents (severe/total). A binomial logistic regression model was ﬁtted in R with y as the response and weight, ejected and type as predictors. Each of the two-level factors will be coded by a single binary dummy variable in R:

x1 =

x2 =

x3 =

i) [1 mark] Complete the following R code to ﬁt the model which has the below summary output.

1 accident . glm < - glm ( . . . , family = " ... " , weights = . . . ) # Complete

3 # Coefficients :

4 # Value Std . Error t value

5 # ( Intercept ) 0.53234250 0.06240868 8.529944

6 # weight 0.27182997 0.06368775 4.268168

7 # ejected 0.49248241 0.06308822 7.806250

8 # type 0.85773710 0.06152721 13.940777

9 # weight : ejected 0.10757942 0.06289578 1.710439

10 # weight : type 0.05367262 0.04845839 1.107602

11 # ejected : type 0.09642300 0.05709306 1.688874

12 #

13 # ( Dispersion Parameter for Binomial family taken to be 1 ) 14 # Null Deviance : 737.8936 on 7 degrees of freedom 15 # Residual Deviance : 0.6689337 on 1 degrees of freedom

Hint: the weights argument is used to reﬂect the number of observations in each group .

ii) [2 marks] What quantity is modelled? Write down the regression model.

iii) [1 mark] What is the log-odds of a severe incident for a small car in- volved in a collision, where the driver was not ejected?

iv) [2 marks] What is the change in log-odds of a severe incident between a standard and a small car, given that both were involved in a collision, where the driver was not ejected?

v) [4 marks] An additive model (no interactions) is also ﬁtted and its summary output is given below.

1 # Coefficients :

2 # Value Std . Error t value

3 # ( Intercept ) 0.5627543 0.05826571 9.658413

4 # weight 0.1683349 0.04305978 3.909331

5 # ejected 0.5151809 0.04945671 10.416805

6 # type 0.8192973 0.04140514 19.787331

7 #

8 # ( Dispersion Parameter for Binomial family taken to be 1 )

9 # Null Deviance : 737.8936 on 7 degrees of freedom

10 # Residual Deviance : 7.309043 on 4 degrees of freedom

At the level of signiﬁcance α = 0.05, test the null hypothesis that the interaction terms can be omitted. State the null and alternative hypoth- esis, provide the value of the observed test statistic, the critical value and draw your conclusions.

3. [10 marks]

Consider the class of functions

f (y; θ, φ) = exp ╱ + C(y, φ)、 (1)

where θ, φ e R and A(φ), B(θ) and C(y, φ) are some functions. In what follows, θ is called the canonical parameter and φ the scaling parameter.

i) [1 mark] Show that functions of this class are members of the expo- nential family, as deﬁned in the course. Provide the expression for the functions a(.), b(.), c(.) and d(.).

ii) [3 marks] The Inverse Gaussian distribution is deﬁned as f (y; µ, γ) = ′ exp ╱ - 、 ,

where µ > 0 and γ > 0. Show that f (y; µ, γ) can be written in the form in (1) by providing the expression for A(φ), B(θ) and C(y, φ) using θ = -1/µ2 and φ = 2/γ .

iii) [6 marks] Consider Yi ~ IG(µi , γ) for i e {1, 2, . . . , N}. Derive the ex- pression for the deviance by comparing the maximal model with diﬀerent values of µi for each Yi and the (minimal) model with µi = µ for all i. Specify the sampling distribution of the Deviance statistic.

4. [18 marks]

Provide your R code for each of the following questions. Marks will be automatically be deducted if no code is provided.

The data in this exercise come from a study that examined the correlation between the level of prostate speciﬁc antigen lpsa and a number of clini- cal measures in 97 men who were about to receive a radical prostatectomy. The measurements include log cancer volume (lcavol), log prostate weight (lweight), age (age), log of benign prostatic hyperplasia (lpbh), seminal vesi- cle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason) and percent of Gleason scores 4 or 5 (pgg45). The data are collated in the attached prostate .txt ﬁle.

i) [2 marks] Load the dataset by importing the prostate .txt ﬁle. Split it into a training set of size 67 and a test set of size 30 using the sample() function (for reproducibility, use set .seed(123)). For the training and test data, plot lpsa as function of lweight.

ii) [2 marks] Fit a linear model using lpsa as the response and the other variables as covariates. What is the estimated regression equation? Dis- cuss the adequacy of the estimated coeﬃcients at the α = 0.05 level of signiﬁcance.

iii) [4 marks] Using R, produce three separate plots displaying: 1) the standardised residuals against the lcavol predictor, 2) the standardised residuals against the ﬁtted values and 3) the ﬁtted values against the response variable lpsa. Based on your plots, comment on whether there is any deviation from the model assumptions?

iv) [3 marks] Perform backward stepwise selection using the AIC to de- termine which predictors are associated with the response. Justify your decision at each step. What is the estimated regression equation?

v) [3 marks] Fit the selected model from the previous question to the test set. What is the estimated regression equation and discuss the adequacy of the estimated coeﬃcients. Does the model seem convincing for the test data and why?

vi) [1 mark] Calculate an estimate of the test MSE.

vii) [3 marks] We are now comparing the model obtained on the train- ing data using backward selection (question (iv)) to the model including

ONLY the signiﬁcant variables when ﬁtted to the test data (question (v)). On the full dataset (training and test sets combined), use leave-one-out cross-validation to select the best model.