STATS 100C, Winter 2022 Final Exam
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 100C, Winter 2022
Final Exam
Notation: In is the n x n identity matrix. There are 6 problems.
Problem 1 (15 points) 5+5+5 = 15
Consider random vector
≠ = '(┌)'(┐) ~ N '(┌)-112'(┐) , I字 .
(a) Is z1 independent of z3 - z字 ? Justify your answer.
(b) For what value(s) of α e R, does (z1 + 2)3 + α(z3 - z字 )3 have a chi-square distribution?
(c) Let P e R字x字 be the projection matrix onto the following subspace V = span !(! '(┌)-11-224'(┐)(ì! .
Find the distribution of |P上≠|3 where P上 = I字 - P .
Hint: Write ≠ = μ + 叫 where μ is a fixed vector and 叫 ~ N (0, I字 ) .
Problem 2 (10 points)
Consider a simple linear regression model
yi = β卜 + β1xi + εi, i = 1, 2, 3
with xi = i/3 for i = 1, 2, 3. Assume that
ε = '(┌)'(┐) ~ N '(┌)1-01 |
-1 2 0 |
|
What is the smallest variance for an unbiased estimate of β 1 ?
Problem 3 (20 points) 8+8+4 = 20 Consider a multiple linear regression model with the intercept, with n = 50 samples and p = 4 covariates, under the standard assumptions.
We fit the model and obtain R3 = 0.6. Call this model M1 .
(a) Perform a significance test of the linear relation in M1 at level α = 0.05.
Assume that we drop three of the four variables and the R3 drops to 0.5. Call this new model M3 . (b) Use an F-test at level 0.05 to choose between M1 and M3 .
(c) Complete the following sentence (with justification):
. . . |
. . . |
(Hints: the intercept remains in the null model for part (a). You do not need to know the value of SST in this problem.)
Problem 4 (10 points)
Consider a linear regression problem with the intercept (included in the model). The model is fit to the data, with sample size n = 12, and the following information is available about the leverage score and the residual of each data point:
i 1 2 3 4 5 6 7 8 9 10 11 12
hii 0.28 0.50 0.38 0.28 0.22 0.28 0.50 0.22 0.38 0.22 0.52 0.22
ei 0.67 1.10 -1.10 0.77 1.13 -0.03 -1.10 -0.67 -0.30 -0.87 -1.40 1.83 For your convenience, |i兰1(n) ei(3) = 12.554 and
i 1 2 3 4 5 6 7 8 9 10 11 12
ei
31 - hii
ei
1 - hii .
Which data point has the most influence on the regression line? Justify your answer. You can use any measure of influence as along as it is appropriate for the task.
Problem 5 (20 points) 5+5+5+5 = 20 Consider a multiple linear regression model 』= |j(p)兰卜 βjαj + ε e Rn . Assume that p = 4 and n = 100 and that the design matrix is full-rank.
(a) What is the effect on the least-squares estimate of the coefficients ( ) if we scale some of the covariates, that is, multiply each column αj of the design matrix X by some number αj e R? Justify your answer by providing some derivation.
Hint: The effect of the above scaling on the design matrix X is to replace it with XD for some diagonal matrix D .
(b) What is the effect on the SSE if we linearly combine covariates α 1 and α3 , as well as covariates
α字 and αA as follows
New design matrix = X1 := ┌α卜 α 1 + α3 α字 - αA ┐ e R1卜卜x字?
In other words, does the SSE of the new model (using the same response variable) go up, down, remain the same, or the information is not sufficient to determine what happens. You can also say, for example, that the SSE of the new model will be “less than or equal to” the original model and so on. Justify your answer.
(c) Can we think of the regression model in part (b) as a nested model relative to the original model? Justify your answer.
(d) Repeat part (b) for the following design matrix
New design matrix = X3 := ┌α卜 α 1 + α3 α 1 - α3 α字 αA ┐ e R1卜卜x3?
Problem 6 (25 points) 5+5+5+5+5 = 25 The following sample correlation matrix C among 5 variables is given
C
## lpsa lcavol lweight lcp lbph ## lpsa 1.0000000 0.7344603 0.4333194 0.548813175 0.179809404 ## lcavol 0.7344603 1.0000000 0.2805214 0.675310484 0.027349703 ## lweight 0.4333194 0.2805214 1.0000000 0.164537142 0.442264399 ## lcp 0.5488132 0.6753105 0.1645371 1.000000000 -0.006999431 ## lbph 0.1798094 0.0273497 0.4422644 -0.006999431 1.000000000
We take variable “lpsa” as the response 』 and consider the rest of variables as potential predictors (i.e., covariates).
(a) Among all the 2-variable linear models for “lpsa”, which model will have the smallest variance
inflation factor (VIF) for the estimated coefficients of the predictors? Which model will have the largest VIF?
For example, one 2-variable model regresses lpsa on (lcavol, lcp}, another one regresses the same response on (lweight, lcp}, and so on. There are ╱3(A)∶ = 6 such models. Each model also includes an intercept.
(b) Suppose that we fit the regression model
lpsa ~ β卜 + β1 lcavol
and the estimated coefficients are
## (Intercept) ## 1.5072975
lcavol
0.7193204
We then form the residual vector from this model, call it e(1} , and fit the regression model e(1} ~ γ卜 + γ1 lcavol.
What will be the estimated coefficients in this model? Justify your answer.
(c) With e(1} as in the previous part, we now fit the following model (here βi are different from before)
e(1} ~ β卜 + β1 lcavol + β3 lweight + β字 lcp
The resulting estimated coefficient vector is
## (Intercept) lcavol lweight lcp
## -2.23534704 -0.14037140 0.67262619 0.08960526
Can you explain why the coefficient of lcavol is nonzero in this fitted model? Under what conditions that coefficient would have been zero?
(d) With e(1} as in part (b), we now fit the following model (here βi are different from before)
e(1} ~ β卜 + β1 lcavol + β3 lbph
and the resulting estimated coefficients are
## (Intercept) lcavol lbph
## -0.006982896 -0.004281508 0.127177470
Can you explain why the coefficent of lcavol is much smaller this time relative to part (c).
(e) Now consider fitting the regression model (here βi are different from before) lpsa ~ β卜 + β1 lcavol + β3 lweight + β字 lcp
Given the information so far, determine the estimated regression coefficients for this model.
2022-03-15