STAC67: Regression Analysis Assignment 3
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAC67: Regression Analysis
Assignment 3
(Total: 100 points)
Please submit R Markdown file for Q. 4- Q. 5 along with your submission of the assignment.
Q.1 (24 points) Show the following statements.
(a) (4 pts) SSR (Sum of Squares of Regression) in matrix notation is:
(b) (4 pts) Show that nJ, H - nJ, and I - H are idempotent and pairwise orthogonal (i.e. the product of each pair gives 0).
(c) (4 pts) Show that is distributed as a non-central chisqure with s\ - 1 degrees of freedom.
(d) (4 pts) Show that is distributed as a ←n(2) -p\ degrees of freedom (e) (4 pts) Show that and are independent.
(f) (4 pts) We consider the general linear hypothesis test:
H0 : K\ α = m uw Ha : K\ α m
Q. 2 (10 points) A researcher fits a multiple linear regression model, relating yield (Y) of a chemical process to temperature (x1 ), and the amounts of 2 additives (x2 and x3 , respectively). She fits the following model:
E(Y) = α0 + α1x1 + α2x2 + α3x3
She wishes to test the following three hypotheses simultaneously:
● The mean response when x1 = 70ì x2 = 10ì x3 = 10 is 80
● The average yield increases by 4 units when temperature increases by 1, controlling for x2 and x3
● The partial effect of increasing each additive is the same (controlling for all other factors)
(a) Specify following matrix and vectors that she is testing (this is her null hypothesis):
H0 : K\ - = .(、) ÷
(b) She obtains the following results from fitting the regression based on n = 24 measurements while conducting the experiment:
(K\ ← - m)\ (K\ (x\ x)-1 K)-1 (K\ ← - m) = 1800; Y\ (I - H)Y = 7800
Q. 3 (20 points) Suppose that X is a categorical variable with 3 levels (A, B, C)
and we define the indicator variable I1 and I2 as:
I1 = I2 =
For a continuous response variable Y consider fitting the linear model Y = 0 + 1 I1 + 2 I2 + :
We take a total sample of n individuals. Let nA , nB , nC be the number of individuals in each category of X and let A ; y¯B ; y¯C be the sample means of Y for individuals in each category of X
(a) (5 pts) Find x\ x and x\ Y
ˆ0 = y¯C ; ˆ1 = y¯A - y¯C ; ˆ2 = y¯B - y¯C :
using both options (each option is 5 points each)
(option 1) ˆ = (X X)t-1Xy.
(option 2) For any parameter values 0 ; 1 ; 2 we therefore need to min- imize the sum of squared errors
n
S( 0 ; 1 ; 2 ) = (yi - 0 - 1 I1i - 2 I2i)2 :
i=1
(c) (5 pts) Let sA(2); sB(2); sC(2) be the usual sample standard deviations of Y for indi- viduals in each category of X . Show that the error sum of squares can be written as
SSE = (nA - 1)sA(2) + (nB - 1)sB(2) + (nC - 1)sC(2)
Q. 4 (20 points) The public health department wished to study the relation between the average estimated probability of acquiring an infection in the hospital (infections, in percent; higher is worse) and the average length of stay of all patients in hospital (StayLength in days, X1 ), the average age of patients (Age, in years, X2 ), the average number of beds in hospital during study period (Beds, X3 ). The data file, ”Infectons.csv” can be found in Quercus. Please ignore the other three variables (MedSchool, Region,and Nurses) for this question.
(a) (4 pts) Obtain the scatter plot matrix and the correlation matrix. Interpret these and state your principal findings. Is there any concern about multi- collinearity?
(b) (4 pts) Fit regression model for three predictor variables to the data and state the estimated regression function. How is αˆ2 interpreted here?
(c) (4 pts) Test whether there is a regression relation; use ~ = 0(05. State the alternatives, decision rule, and conclusion. What does your test imply about α1 , α2 , and α3 ? What is the P-value of the test?
(d) (4 pts) Calculate the coefficient of determination, and also adjusted coefficient of determination. What does it indicate here?
(e) (4 pts) Obtain a 90 % prediction interval for a new hospital infection rate when StayLength = 10, Age = 45, and Beds = 150. Interpret your prediction interval.
Q. 5 (26 pts) We will use the same dataset, “Infections.csv” in Question 4 for this question. Following are the description of variables that will be used:
● Infections (Y): the average estimated probability of acquiring an infection in the hospital, in percent; higher is worse
● Beds: the average number of beds in hospital during study period
● Region: geographic region (NE = Northeast, NC = North Central, S = South, W = West)
(a) (8 pts) Write down the full model with the interaction terms. Fit the full model in R. Compute the estimated regression functions for geographic region and plot them.
(b)(4 pts) Test whether the slopes relating the average number of beds to infections are the same for each geographic region at the ~ = 0(05, significance level.
(c) (2 pts) What model would you choose for this data? Justify your answer.
(d) (6 pts) For the model you chose in (c), check and comment on the standard assumptions for regression model.
(e) (6 pts) Look for the transformation of Y and/or X (=Beds). Fit the regres- sion with the transformed variable(s) without interaction and comment whether this model fits better.
2022-11-11