Advanced Data Analysis (557), Spring 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Advanced Data Analysis (557), Spring 2021
MIDTERM ExAMINATIoN
1. Use the following probability distribution to answer the questions:
|
塞 = · 1 |
塞 = 0 |
塞 = 1 |
y = 0 |
0.10 |
0.25 |
0.20 |
= 1 |
0.20 |
0.10 |
0.15 |
(a) Compute E[y] 三 6y and V[塞] 三 ux(2) .
(b) Compute C[塞. y] 三 uxy .
(c) Compute E[yI塞] and V[yI塞]. Is homoskedasticity a reasonable assumption for this population?
(d) Compute the best linear predictor of y given 塞: L[yI塞] 三 a + 8塞. (e) Are E[yI塞] and L[yI塞] identical for this population? Why or why not?
(f) Suppose you have a sample ((yi . 塞i )} from the above population. Discuss the competing merits
of the following estimators for P[y = 1I塞]
i. the linear probability model (LPM);
ii. the logit model;
iii. linear discriminant analysis (LDA).
2. A demand model (“Model 1”) for residential natural gas consumption is given by git(G) = 80 + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit
where
git(G) is the log of consumption of natural gas in state i in year t (i.e., git(G) = log ○it(G)); pit(G) is the log price of natural gas in state i in year t;
pit(O) is the log price of fuel oil in state i in year t;
pit(E) is the log price of electricity in state i in year t;
hddit is the log count of “heating degree days” in state i in year t;
yit is log real per capita personal income in state i in year t;
The data file natgas.rda (on Canvas) covers six US states (CA, FL, MI, NY, TX, and UT) and 23 years (1967-1989). The following questions refer to this data.
(a) Estimate Model 1 using OLS. What do the estimation results say about the demand relationship
between the three heating fuels (natural gas, fuel oil, electricity)?
(b) Is the demand curve for natural gas downward sloping? Formulate an appropriate test, being
specific about the size of the test, the null hypothesis H0 and the alternative hypothesis H1 . Perform the test and report the results appropriately.
(c) Consider a second model (“Model 2”) in which demand varies by state according to a state-specific unobserved effect ci , giving
git(G) = 80 + ci + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit 口
Estimate Model 2 using OLS and report any differences you observe in the estimated coefficients on pG , pO and pE between Model 2 and Model 1.
(d) Are there state-specific demand effects? Formulate an appropriate test, again being specific about the size of the test, the null hypothesis and the alternative hypothesis. Perform the test and report the results appropriately.
(e) Discuss any difference in the regression R2 between Model 1 and Model 2. Is that same difference reflected in the “adjusted R2 ”?
(f) A commenter on your Model 2 observes that year-specific demand effects may also be present and
suggests Model 3, given by
git(G) = 80 + ci + δt + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit
where δt is a year-specific unobserved effect. Estimate Model 3 by OLS and comment on any differences in the regression fit or estimated coefficients on pG , pO and pE between Model 3 and Model 2. Are the coefficient estimates plausible? Can the year-specific demand effects be removed from Model 3?
(g) Perform 5-fold cross validation on Models 1, 2 and 3. Which model has the lowest training error?
Which model has the lowest test error? Which model do you prefer (and on what basis)? Note: Due to the large number of indicator variables, some partitions into training and validation sets make prediction from the validation set impossible. If that happens, simply restart CV with a different training/validation partition.
3. The data set burninj.rda (on Canvas) contains information collected on victims of burn injury. The variables are as follows:
death Whether the burned individual died of their injury;
age the individual’s age;
sex the individual’s sex;
race the individual’s race (coded as “White”/“Non-White”);
area the total burn area;
inh whether the burn involved injury due to inhalation;
fire whether the burn involved open flame.
The following questions refer to this data.
(a) Estimate the following model by logit regression:
P(deathi = 1Ix) = 80 + 81 agei + 82 areai + 83 racei + 84 inhi
Do all of the coefficients take expected signs?
(b) Compute the average partial/marginal effect of area across the sample and a 95% confidence
interval for that effect (you may find the margins package useful).
(c) Bootstrap a 95% confidence interval for that same statistic.
(d) Compute the average partial effect of race.
(e) Compute the confusion matrix for the fitted logit model.
(f) The burninj data contains the outcome and p = 6 covariates. The following snippet of R-code returns a list containing a model formula for each of the 2p = 64 subsets of those p = 6 covariates. For example, the formula corresponding to the model you estimated above is death ~ 1 + age + area + race + inh.
load("burninj.rda")
vNames <- names(burninj)[2:7]
allBurnSubs <- as.matrix(expand.grid(c(T,F),c(T,F),c(T,F),c(T,F),c(T,F),c(T,F))) frmList <- lapply(1:nrow(allBurnSubs),function(i)(
as.formula(paste(c("death ~ 1", vNames[allBurnSubs[i,]]), collapse= " + ")) })
Using this code snippet, estimate each of the 64 models and extract the AIC corresponding to the model. You may find the function extractAIC() useful. Report the model identified as best according to the AIC criterion.
(g) Now suppose that you start from the largest model and, working stepwise, remove the the variable
which contributes least to the AIC until you arrive at the null model. Does this path through the model space contain the “best” model from the previous part? Suppose instead that you start from the null model and, working stepwise, add the variable which increases AIC the most?
2022-04-25