Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 1

E6690: Statistical Learning for Bio & Info Systems

P1.  Let

 =        Xi     and   S2 =       (Xi . )2 .

Show that:

(a)  (2pt)     i_(n)1 Xi(2) = (n . 1)S2 + n2

(b)  (2pt) If X1, X2 , ..., Xn are independent and identically distributed (i.i.d.), the S2  is an unbiased estimator

of σ2 , i.e., ES2 = σ 2

In the following, in addition to the above, assume that Xi-s have normal/Gaussian distribution N(µ, σ2 ).

(c)  (3pt) Show (prove) that  is independent of Xi .  , i = 1, 2, . . . , n. (Hint:  Both  and Xi .  are normal.)

(d)  (3pt) Show (prove) that the sample mean, , is independent of the sample variance, S2 .

 

P2. (10pt) Show that in the case of simple linear regression between Y and X, the R2 statistic is equal to the square of the correlation coefficient between X and Y (r2 ).  For simplicity, you may assume that y¯ =  = 0.

Recall that

i_1(yi . y¯)2                            ^   i_(n)1 (xi . )(yi . y¯)2 .

P3.  (20pt; each bullet 2pt) Create some simulated data and t simple linear regression models to it.  Make sure to use set .seed(1) prior to starting part (a) to ensure consistent results.

(a)  Using the rnorm() function,  create  a vector,  x,  containing  100 observations drawn from  a N(0, 1)

distribution. This represents a feature, X .

(b)  Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25)

distribution.

(c)  Using x and eps, generate a vector y according to the model  Y = .1 + 0.5X + e.

What is the length of the vector y? What are the values of β  and β 1  in this linear model?

(d)  Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

(e)  Fit a least squares linear model to predict y using x.  Comment on the model obtained.  How do βˆand βˆ1  compare to β  and β 1 ?

(f)  Display the least squares line on the scatterplot obtained in (d).  Draw the population regression line on

the plot, in a different color.  Use the legend() command to create an appropriate legend.

(g)  Now fit a polynomial regression model that predicts y using x and x2 . Is there evidence that the quadratic

term improves the model fit?  Explain your answer.

(h)  Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the

data. The model in (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term e in (b).  Describe your results.

(i)  Repeat (a) . (f) after modifying the data generation process in such a way that there is more noise in the data.  The model in (c) should remain the same.  You can do this by increasing the variance of the normal distribution used to generate the error term e in (b).  Describe your results.

(j) What are the confidence intervals for β  and β 1  based on the original data set, the noisier data set, and

the less noisy data set? Comment on your results. (You could use the confint() function.)

P4.  (10pt) Using R and Advertising data set, find 92% confidence intervals for β  and β 1  for three single- feature linear regressions of Sales versus Newspaper, TV and Radio, respectively.  Then, create a scatterplot for each of them with the 92% confidence interval lines, i.e., draw the lines that correspond to the ends of confidence intervals for (β〉, β1 ). The answer should include the R code and graphs.

P5. Consider the Auto data set:

(a)  (5pt) Produce a scatterplot matrix which includes all of the pairs of variables in the data set.

(b)  (5pt) Compute the matrix of correlations between the variables using the function cor(). You will need

to exclude the name variable, which is qualitative.

(c)  (5pt) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors.  Use the summary() function to print the results.  Comment on the output.  For instance:

i.  Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d)  (5pt) Try a few different transformations of the variables, such as log(X ), ^X , X2 .  Comment on your findings.

P6. (10pt) A data set has n = 20,

2〉                                 2〉                                 2〉                                  2〉                                             2〉

xi = 8.552,          yi = 398.2,          xi(2) = 5.196,          yi(2) = 9356,    and          xiyi = 216.6.

i_1                               i_1                              i_1                               i_1                                           i_1

Calculate βˆ, βˆ1  and 2 . What is the fitted value when x = 0.5? Compute R2 .

P7. (10pt) The multiple linear regression model

y = β〉+ β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6

is tted to a data set of n = 45 observations. The total sum of squares is TSS = 11.62, and the residual sum of squares is RSS = 8.95. What is the p-value for the null hypothesis

H:    β 1 = β2 = β3 = β4 = β5 = β6 = 0   ?


Extra Credit

Under normal assumptions we can compute the distributions of a lot of quantities explicitly.

E1. (5pt) Chi-squared distribution. Let X1, X2 , . . . , Xn  be independent standard normal random variables and

recall that Chi-squared random variable with n degrees of freedom is dened as χn(2)  = X 1(2) + X2(2) + + Xn(2) .

gn(x) = xn/2 − 1 e −北/2 ,

where Γ(x) is the gamma function. (Hint:  Prove rst for n = 1, 2, and then use the mathematical induction.)

E2. (5pt) Let X1, X2 , . . . , Xn  be independent normal random variables N(µ, σ2 ).  Prove that

(n . 1)S2   d     2

where  stands for equality in distribution.

(Hint:  Derive the moment generating function of χn(2)  and use problem P1.(a) and (d).)

E3. (5pt) Student’s t distribution. Let tn  be student’s t variable, defined as

tn  =      Z     

^χn(2)/n ,

where Z ~ N(0, 1).  Prove that tn  has the density

Γ((n + 1)/2)                1            

^πnΓ(n/2)    (1 + t2 /n)n(1│/2 ,

where Γ(x) is the gamma function.  Show that for large values of n, fn(t) is approximately normal, fn(t) ≈ e t2 /^2π . (Hint: First show that the conditional density (distribution) of tn given χn(2) = x is normal with mean 0 and variance ^n/x. Then, use problem E1. to integrate this conditional density.)

E4. (5pt) F (Fisher) distribution. Let U and V be two independent Chi-squared random variables with degrees of freedom n1  and n2 , and define the random variable, F ~ F (n1 , n2 ), as

U/n1

F =

V/n2 .

Show that the density of F is given by

(n1 /n2 )n1 /2Γ[(n1 + n2 )/2]w←n1 /2│ − 1   

Γ[n1 /2]Γ[n2 /2][1 + (n1w/n2 )] n1 (n2 │/2 .

(Hint: Compute rst the distribution of F given V .)