闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Advanced Data Analysis (557), Spring 2021

MIDTERM ExAMINATIoN

1. Use the following probability distribution to answer the questions:

	塞 = · 1	塞 = 0	塞 = 1
y = 0	0.10	0.25	0.20
= 1	0.20	0.10	0.15

(a) Compute E[y] 三 6y and V[塞] 三 ux(2) .

(b) Compute C[塞. y] 三 uxy .

(d) Compute the best linear predictor of y given 塞: L[yI塞] 三 a + 8塞. (e) Are E[yI塞] and L[yI塞] identical for this population? Why or why not?

(f) Suppose you have a sample ((yi . 塞i )} from the above population. Discuss the competing merits

of the following estimators for P[y = 1I塞]

i. the linear probability model (LPM);

ii. the logit model;

iii. linear discriminant analysis (LDA).

2. A demand model (“Model 1”) for residential natural gas consumption is given by git(G) = 80 + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit

where

git(G) is the log of consumption of natural gas in state i in year t (i.e., git(G) = log ○it(G)); pit(G) is the log price of natural gas in state i in year t;

pit(O) is the log price of fuel oil in state i in year t;

pit(E) is the log price of electricity in state i in year t;

hddit is the log count of “heating degree days” in state i in year t;

yit is log real per capita personal income in state i in year t;

The data ﬁle natgas.rda (on Canvas) covers six US states (CA, FL, MI, NY, TX, and UT) and 23 years (1967-1989). The following questions refer to this data.

(a) Estimate Model 1 using OLS. What do the estimation results say about the demand relationship

between the three heating fuels (natural gas, fuel oil, electricity)?

(b) Is the demand curve for natural gas downward sloping? Formulate an appropriate test, being

speciﬁc about the size of the test, the null hypothesis H0 and the alternative hypothesis H1 . Perform the test and report the results appropriately.

(c) Consider a second model (“Model 2”) in which demand varies by state according to a state-speciﬁc unobserved eﬀect ci , giving

git(G) = 80 + ci + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit 口

Estimate Model 2 using OLS and report any diﬀerences you observe in the estimated coeﬃcients on pG , pO and pE between Model 2 and Model 1.

(d) Are there state-speciﬁc demand eﬀects? Formulate an appropriate test, again being speciﬁc about the size of the test, the null hypothesis and the alternative hypothesis. Perform the test and report the results appropriately.

(e) Discuss any diﬀerence in the regression R2 between Model 1 and Model 2. Is that same diﬀerence reﬂected in the “adjusted R2 ”?

(f) A commenter on your Model 2 observes that year-speciﬁc demand eﬀects may also be present and

suggests Model 3, given by

git(G) = 80 + ci + δt + 81pit(G) + 82pit(O) + 83pit(E) + 84 hddit + 85 yit + uit

where δt is a year-speciﬁc unobserved eﬀect. Estimate Model 3 by OLS and comment on any diﬀerences in the regression ﬁt or estimated coeﬃcients on pG , pO and pE between Model 3 and Model 2. Are the coeﬃcient estimates plausible? Can the year-speciﬁc demand eﬀects be removed from Model 3?

(g) Perform 5-fold cross validation on Models 1, 2 and 3. Which model has the lowest training error?

Which model has the lowest test error? Which model do you prefer (and on what basis)? Note: Due to the large number of indicator variables, some partitions into training and validation sets make prediction from the validation set impossible. If that happens, simply restart CV with a diﬀerent training/validation partition.

3. The data set burninj.rda (on Canvas) contains information collected on victims of burn injury. The variables are as follows:

death Whether the burned individual died of their injury;

age the individual’s age;

sex the individual’s sex;

race the individual’s race (coded as “White”/“Non-White”);

area the total burn area;

inh whether the burn involved injury due to inhalation;

ﬁre whether the burn involved open ﬂame.

The following questions refer to this data.

(a) Estimate the following model by logit regression:

P(deathi = 1Ix) = 80 + 81 agei + 82 areai + 83 racei + 84 inhi

Do all of the coeﬃcients take expected signs?

(b) Compute the average partial/marginal eﬀect of area across the sample and a 95% conﬁdence

interval for that eﬀect (you may ﬁnd the margins package useful).

(d) Compute the average partial eﬀect of race.

(e) Compute the confusion matrix for the ﬁtted logit model.

(f) The burninj data contains the outcome and p = 6 covariates. The following snippet of R-code returns a list containing a model formula for each of the 2p = 64 subsets of those p = 6 covariates. For example, the formula corresponding to the model you estimated above is death ～ 1 + age + area + race + inh.

load("burninj.rda")

vNames <- names(burninj)[2:7]

allBurnSubs <- as.matrix(expand.grid(c(T,F),c(T,F),c(T,F),c(T,F),c(T,F),c(T,F))) frmList <- lapply(1:nrow(allBurnSubs),function(i)(

as.formula(paste(c("death ~ 1", vNames[allBurnSubs[i,]]), collapse= " + ")) })

Using this code snippet, estimate each of the 64 models and extract the AIC corresponding to the model. You may ﬁnd the function extractAIC() useful. Report the model identiﬁed as best according to the AIC criterion.

(g) Now suppose that you start from the largest model and, working stepwise, remove the the variable

which contributes least to the AIC until you arrive at the null model. Does this path through the model space contain the “best” model from the previous part? Suppose instead that you start from the null model and, working stepwise, add the variable which increases AIC the most?