Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment/Quiz 2 Questions

1. Assume a Gaussian linear model:

Y ∼ N(Xβ,σ2 In ),

where X ∈ Rn ×p , β ∈ Rp , and σ 2  are fixed/given matrix, vector, and scalar, respectively.

(a) Write down the joint pdf: g(y |β,σ2 , X).

(b)  Compute the maximum likelihood estimate of β:

= argminlng(y |β,σ2 , X).

β

(c)  Compute the maximum likelihood estimate of σ 2 :

2  = argminlng(y | ,σ 2 , X).

2

2.  Suppose that Y ∼ N(Xβ,σ2 In ),  where X ∈ Rn ×p , β ∈ Rp ,  and σ 2   are fixed/given matrix, vector, and scalar, respectively. Let

= X+ Y

2  = Y X/(n2 T),

where T = rank(X). Show that:

(a)

E = β ,

when rank(X) = T = p.

solution: Since X is full rank we have

X+  = (XX)+ X= (XX)1 X

Hence, E = X+ Xβ = β .

(b)

E2  = σ 2 ,

using the identity EXAX = µAµ + tr(AVar(X)).

solution: For this part using the projection property  (I XX+ )2   =

(n r)E2  = EY X2

= EY XX+ Y 2

= E(I XX+ )Y )2

= EY(I XX+ ) Y2

= EY(I XX+ )Y

= E[Y](IXX+ )E[Y] + tr((IXX+ )Var(Y )) = E[Y](IXX+ )Xβ + σ tr(I2 − XX+ )

Therefore using

(I XX+ )X = X XX+ X = X X = 0

we obtain

(n r)E2  = σ tr(I2n XX+ ) = σ  (n2 r),

because using the SVD we can write

X+ X = (USV)+ USV

= (VS+ U)USV

= VS+ SV

= V [0(Ir)   0(0)] V.

and hence

tr(XX+ ) = tr(X+ X) = tr [V [0(Ir)   0(0)] V] = r

If X is full rank, then this is simpler to write, because

tr(XX+ ) = tr(X+ X) = tr((XX)1 XX) = tr(Ip ) = p = r. In summary, an unbiased estimator of σ 2  is

2  = Y XX+ Y 2 .

n − p

Recall that in first year you are taught that

n

(c)

X+ (X+ )= (XX)+ .

solutions: We prove this by simply substituting and using X+  = (XX)+ X:

X+ (X+ )= (XX)+ X((XX)+ X)

= (XX)+ XX[(XX)+]

= (XX)+ XX(XX)+

= A+ AA+  = A+

= (XX)+

(d)

[Y X] N ([X+o(X)β] , σ 2  [(X0(⊤X))+ In X(0)X+])

Hence, deduce that is independent of Y X2 .

[X(X)2(1)] N ] , ]) .

We know that Y is multivariate Gaussian and any linear transformation of a Gaussian variable yields another multivariate Gaussian. Thus, from

[Y X] = [In X+]Y

A

we can conclude that  [Y X] is multivariate Gaussian with mean [In X(+)X+] E[Y] = [In X(+)X+] Xβ = [X+o(X)β] .

The covariance is:

Var(AY ) = σ AA2

= [In X(+)X+] [(X+ )In XX+J

= [(In X(+)X(⊤)+ )I(+) X(X)2(+))]

= [(X0(⊤X))+     (In X(0)X+ )]

Therefore,

Var() = σ2 (XX)+

Var(Y X) = σ (I2 XX+ )

Cov( , Y X) = 0

Since and Y X are jointly normal with zero covariance, then they

other words, and 2  are independent.  This is going to be used in the

next part.

(e) If T = p, then

σ 2           =             σ 2 χn r .

solution: We know that

X Σ1 X χr(2) ,

where X N(0, Σ).

In particular, we have the quadratic form

Y (In XX+ )Y /σ2  = Y(In XX+ )(In XX+ )+ (In XX+ )Y /σ2

= Z(In XX+ )+ Z ,

where (Y N(Xβ,σ2 In ))

Z = (In XX+ )YN(o, In XX+ ).

example,

E[Z] = (In XX+ )E[Y]/σ = (In XX+ )Xβ/σ = o

and

Var(Z) = In σ(X)X+ Var(Y )In σ(X)X+ = In σ(X)X+ σ I2n In σ(X)X+ = In XX+ .

Therefore,

Z(In XX+ )+ Z χn(2)r .

(f)  If r = p, then

βˆj Eβˆj = (βˆj Eβˆj )/(σei(⊤)X+ ) tn r .


solution: From part 5 we know that

(n r)2 /σ 2 χn(2)−r

From part 4, we know that 2  is independent of N(E,σ 2 (XX)+ ),

so that

βˆj E[βˆj ] N(0, 1)

σ ^[(XX)+]jj

From Quiz 1 sheet, we have that

βˆj E[βˆj]

σ ^[(XX)+ ]jj

In other words, we have

^ X(βj)+]jj tn p .

From Part 3, we know that

[(XX)+]jj  = ej(⊤)(XX)+ ej  = ej(⊤)X+ (X+ )ej  = ej(⊤)X+ 2 .

3. For the simple linear regression Y = β0 + β1 x + ϵ, show that R2  is the same as the sample correlation between the response and the explanatory variable:

R2  = (i (yi y¯)(xi ))2

i (yi y¯)2 i (xi )2.

solution: First, from the definition in the notes on page 39, Section 2.3.7, we know that

R2  = y()12 = y(yˆ) y¯(y¯)1(1)2(2)  = i ( y(i)i(+)) y¯)2

For a simple linear regression, we know that

b0  = y¯ b1

b1  = i (xi )(yi y¯)

i (xi )2

Substituting these gives:

i (b1 xi + y¯ b1 y¯)2 = i (b1 [xi ])2

i (yi y¯)2 i (yi y¯)2

= b1(2) i [xi ]2

i (yi y¯)2

= (i (xi )(yi y¯))2

i (yi y¯)2 i (xi )2,

which completed the proof.

4.  Show that Ra(2)djusted R2 .

solution: We recall that

Ra(2)djusted  = 1 (1 R )2 n 1

Therefore,

When n > p 1 we have that

≥ 1.

Therefore,

1 Ra(2)djusted n 1

1 R2                n p

We conclude,

(1 R )2 1 Ra(2)djusted

Hence,

Ra(2)djusted R2 ,

which makes sense, because Ra(2)djusted  is less optimistic about the model than R2  (higher R2  means better fit to the data, possibly overfitting).

5. For the diabetes dataset, compute the 2-fold cross-validation loss as an estimate of the expected generalization risk of the linear learner. Report the numerical value.

solution: Without reordering the data we get: 3250.9

6. For the diabetes dataset, compute the leave- one- out cross-validation loss (the PRESS statistic divided by n) as an estimate of the expected generalization risk of the linear learner, and report the numerical value.

Perform the computation of the leave- one- out cross-validation in two different ways: 1) one using the fast PRESS statistic formula; 2) another using a brute force retraining of the linear learner.

solution: The value for n-fold CV is: 3147

7. For the diabetes dataset, use the estimate

(In XX+ )Y 2 2p

where σ 2   ≈ 3000, of the in-sample risk to decide if the following predictors should be jointly included/excluded in the linear model: age,glu,tch,ldl?

After making your decision about which features to include in the model matrix

X, then estimating the corresponding coefficients , create a qq-plot of the residuals yX .

soln: in-sample risk estimate using all of the predictors is 3137.9 (here p in- cludes the constant feature); in-sample risk estimate after removing the pre- dictors is 3088.5; Thus, we prefer dropping these predictors.

The coefficients estimated after dropping the predictors are:

152.43, 233.31, 576.45, 287.26, 171.16, 197.03, 620.38

The residuals look like this

and a qqplot looks like this:


8. For the diabetes dataset, compute a 95% numerical confidence interval for βj that corresponds to the predictor age” .

answer approximately [-381 , -86]

9. Download file risk .csv. The goal is to predict risk from the other variables. Do an F-test to check if the explanatory variables are all jointly relevant and report the R2 .

10. Here we use fish .csv dataset.  The variables weight (in grams) and length (in millimetres) in this data set are the lengths and weights of 23 different catfish captured in the Kanawha River in Charleston, West Virginia.  It was desired to estimate the angler harvest of channel catfish, and for live fish length is much easier to measure than weight.  Hence it was of interest to study the length/weight relationship for channel catfish.

(a) Train a simple linear regression model with weight as response and length as predictor.

(b) It is conjectured that the weight of a fish varies with length by the follow- ing relationship:

log10 (Y) ≈ β0 + β1 log10 (x) + ϵ,

where y is the weight and x is the length.

Train a simple linear regression model with log-weight as response and log-length as predictor.

(c) Plot a scatterplot and estimate the generalization risk for both models. Explain which mode is preferable based on the scatterplot and gen. risk. Assume interest is in prediction of y in natural units (not in log units).

answer: The coefficient on the raw data is: b =  [ 884.31, 3.8444].   The

The following gives the visual diagnostic.   The left plots correspond to the linear fit

b0 + xi b1

and the right plots correspond to the fit

100 +log 10 (xi)1

From top to bottom we have the lines of best fit, residuals, and scatterplots.



















1000


800


600


400


200


0

200      250      300      350     400     450      500


1200

1000

800

600

400

200

0

200      250      300      350     400     450      500






200

150

100

50

0

-50

- 100

200      250      300      350     400     450      500


150


100


50


0


-50


- 100

200      250      300      350     400     450      500



QQ Plot of Sample Data versus Standard Normal



200


100


0


- 100


-200

-2         - 1         0          1          2 Standard Normal Quantiles


150


100


50


0


-50

- 100

-2         - 1         0          1          2

Standard Normal Quantiles

13