Statistics 151A (Linear Models)
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Homework 5
Statistics 151A (Linear Models)
Instructions: Please submit with a cover sheet that has your name and student ID. For question on data analysis, please format your report as a document with a brief intro- duction, instead of a list of numbered answers. Include R code and output only in small
portions that directly illustrate points you make in your writing. Please include your full code in the appendix. Make sure to comment your code and label visuals appropriately.
1. Show that the ridge coefficent has two equivalent forms:
(XT X + λIP)− 1XT Y = XT(XXT + λIn)− 1Y.
On the computational side, explain when will the left hand side be useful and when will the right hand side be useful. (5 points)
2. (Please answer this question without using R) Consider the frogs dataset that we used in lab. To describe the data briefly, 200 sites of the Snowy Mountain area of
New South Wales, Australia were surveyed for the species of the Southern Corro- boree frog. The response variable, named pres.abs, takes the value 1 if frogs of this species were found at the site and 0 otherwise. The explanatory variables include al- titude, distance, NoOfPools, NoOfSites, avrain, meanmin and meanmax. The dataset
contains 200 observations and the response variable equals one for 75 observations and equals 0 for the rest. Suppose we fit a logistic regression model to the data via
frogs.glm <- glm(formula = pres.abs ~ log(distance) +
log(NoOfPools) + meanmin,
family = binomial, data = frogs)
summary(frogs.glm)
This gave us the following output:
Call:
glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + meanmin,
family = binomial(link = "logit"), data = frogs)
Deviance Residuals:
Min 1Q Median
-1.9379 -0.7512 -0.4699
3Q
0.8643
Max
2.3081
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7936 XXXXX 0.352 0.724577
log(distance) log(NoOfPools)
meanmin
---
Signif. codes:
XXXXX 0.4961 1.0717 |
0.2116 0.2067 0.3187 |
-4.247 2.17e-05 *** 2.400 0.016381 * 3.362 0.000773 *** |
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 264.63 on XXX degrees of freedom
Residual deviance: XXXXX on XXX degrees of freedom
AIC: 210.9
Number of Fisher Scoring iterations: 5
a) Fill the five missing values in the above output giving appropriate reasons and calculations. (3 points)
b) Suppose a new site is found where the values of the explanatory variables are
distance = 265 NoOfPools = 26 meanmin = 3.5
According to the logistic regression model, what is the predicted probability that Southern Corroboree frogs will be found at this site? (3 points)
c) Suppose we add the variable altitude to the model. Would the residual deviance increase or decrease? Would the null deviance increase or decrease? Explain with reason. (2 points)
3. Selection of baseline category in multinomial logistic regression: Suppose that the response variable Y takes any of m categories. Let πij denote the probability that the
ith observation falls in the jth category of the response variable, i.e. πij ≡ P(Yi = j) for j = 1, ..., m and X1, ..., Xk denote k regressors on which the πij depend. We have learned in class that the multinomial logistic regression can be written as:
πij ln
= γ0j + γ1jXi1 + · · · + γkjXik for j = 1, ..., m − 1
with resulting probabilities:
πij =
πim =
exp (XTγj) |
1 + ∑l<mexp (XTγl) , 1 |
1 + ∑l<mexp (XTγl) . |
j < m
Show that if we choose a different baseline category j′ instead of m, we obtain the same set of probabilities. (9 points)
4. Data Analysis: Download the train.csv from https://www.kaggle.com/c/ titanic/data (this is the competition Titanic: Machine Learning from Disaster from Kaggle). Randomly make 2/3 of train.csv into a training dataset. The other third will be your test data.
a) Using the training data, build a reasonable model based on logistic regression for the survival status based on the explanatory variables (you can start with a basic model and subsequently either expand it using interactions etc. and/or perform model selection to remove some variables). Describe your model. (4 points)
b) Use your model to predict the survival status yˆi for nt subjects present in the test data. Here, yˆi are estimated probabilities rounded to 0 or 1. Report the accuracy of your predictions in terms of misclassification rate, which is defined
as:
nt
∑ I(yi yˆi).
(3 points)
c) Is your final model most suitable for association, prediction or causal inference? (1 point)
2021-12-05