MAT3375: Midterm Examination
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MAT3375: Midterm Examination
Date: Thursday October 20, 2022
1. Consider a simple linear regression model: y = βo + β1 xi + ei . (10 points)
a. List the Gauss-Markov conditions. (3 points)
b. Define residuals, and provide explanations/descriptions of how we expect the residuals to behave, when the Gauss-Markov conditions are satisfied. (2 points)
c. Suppose Gauss-Markov conditions are satisfied. Show that βˆo and βˆ1 (formulas given below) are unbiased estimators of the true parameters βo and β1 . (5 points)
βˆo = y¯ − βˆ1 and βˆ1 =
2. A sample data from the Framingham’s Heart Study was analyzed in R to produce the following results. Multiple regression model was fit, where the dependent variable is systolic blood pressure (sbp1) and independent variables age (in years), sex (male=1, female=2) and diabetes status (positive=1, negative=0). (15 points)
Call:
lm(formula = data1$sbp1 ~ data1$age + data1$sex + data1$diabetes)
Residuals:
Min 1Q Median 3Q Max
-25.4684 -7.8459 0.1738 6.5514 31.1936
Coefficients:
(Intercept) data1$age data1$sex2 data1$diabetes1
---
Signif. codes:
Estimate Std. Error t value
13.558 3.955 -0.309 2.621
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
Pr(>|t|)
< 2e-16 ***
0.000274 ***
0.759090
0.012003 *
0.05 ‘ . ’ 0.1 ‘ ’ 1
Residual standard error: 11.53 on 44 degrees of freedom
Multiple R-squared: 0.4188, Adjusted R-squared: 0.3792
F-statistic: 10.57 on 3 and 44 DF, p-value: 2.34e-05
a. How many number of individuals did we use in the above analysis? (2 points)
b. R2 = 0.4188 represents the proportion of total variation in blood pressure (the dependent variable) that is explained by the regression model. What is the total number of variation? Formula to calculate the total variation (corrected sum of squares) corresponding to blood pressure (y) is given by ∑(yi − y¯)2 (4 points)
c. How do you interpret the estimates of the regression coefficients for age, sex and diabetes status? (3 points)
d. Write down the hypothesis for testing the overall significance of the regression model, and provide a complete description of the test statistic used with its corresponding distribution. Based on the results from our analysis, do we reject the null hypothesis? Justify your results. (2 points)
e. Consider the p − value = 0.000274 corresponding to age. Describe the steps used to calculate this p-value. (2 points)
f. How is the last p-value (p − value : 2.34e − 05) calculated? show your steps. (2 points)
3. Consider multiple linear regression with k independent variables, described using ma- trix formulations as: y = Xβ + ϵ, where y is a column vector of length n; β is a column vector of length k + 1, consisting of the regression coefficients (including the intercept); X is a matrix of dimension n by k + 1 consisting of measurements from the k independent variables and one additional column vector of 1’s corresponding to the intercept; and ϵ is a column vector of length n consisting of the error terms. Sup- pose the error terms are independently and identically distributed according to the normal distribution with mean zero and variance σ2 , i.e ei ∼ N (0, σ2 ). The maximum likelihood estimator of β is given by = (X\X)−1X\y (10 points)
a. What is the distribution of ? Justify your results. (1 points)
b. Calculate the expected value and variance-covariance matrix for . (4 points)
c. Show that \ = \X\y and r\ r = y\y − \X\y, where and r are the predicted values and the residuals, respectively. (5 points)
4. Consider the dataset described in Q#2 and the output from R corresponding to this dataset. (5 points)
a. Provide the dimensions (number of rows and columns) of C = (X\X)−1 corre- sponding to the data. (1 point)
b. Based on the results from R, provide the values of the diagonal elements of the matrix C . (4 points)
2022-12-28