闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT 425 Exam 2 Study Problem Solutions

Exam problems are generally be shorter than homework problems and may involve short answer conceptual questions, quick calculations and R code interpretation or debugging. It is not a multiple choice exam, although some multiple choice questions are possible.

The sample problems below are to help you test yourself and practice solving. Problems on the exam will generally have fewer parts to them than the ones below. Do not expect the actual exam problems to be exactly like this set in

terms of range of coverage or length. Work on these various problems as a way

to solidify your understanding.

1. Twenty chicks (baby chickens) were randomly assigned to receive one of two diets, A or B, with 10 in each group. Consider the model

yij = µ + αi + eij , i = 1, 2; j = 1, 2, . . . , 10.

Here yij denotes the 14-day weight gain for the jth chick on Diet i with i = 1 for Diet A and i = 2 for Diet B. The working model is that the errors are independently normally distributed with mean zero and variance σ 2 .

a) Suppose the sample mean responses for the two diet groups are y¯A = 101.2 and y¯B = 123.7. Using the reference category constraint with Diet A as the reference category, calculate the least squares estimates of µ , α 1 and α2 .

= y¯A = 101.2

1 = 0

2 = y¯B - y¯A = 123.7 - 101.2 = 22.5

b) Calculate the between group sum of squares FSS = i(2)=1 ni (y¯i} - y¯}} )2 .

n1 = n2 = 10 1} = 101.2 y¯2} = 123.7

10 * 101.2 + 10 * 123.7 101.2 + 123.7

20 2

BSS = 10 * (101.2 - 112.45)2 + 10 * (123.7 - 112.45)2 = 2 * 10 * 11.252 = 2531.25

c) How many degrees of freedom does BSS have?

2 - 1 = 1

d) Suppose (yij - yˆij )2 = 49.0. Calculate the value of the F-statistic for testing the null hypothesis H0 : µ 1 = µ2 = 0, where µ 1 is the mean response for Diet A, and µ2 is the mean response for Diet B.

BSS/1 2531.25

F = RSS/(20 - 2) = 49/18 = 929.85

2. A study was conducted to compare three drug treatments for a certain disease, with drugs labeled A, B and C. For each subject Pretreatment is a condition score before treatment. The response PostTreatment is the condition score after the treatment regimen. The goal is to determine whether there is a diﬀerence between the drugs in improving post-treatment condition, adjusting for the eﬀect of pre-treatment condition.

An analysis of covariance model was ﬁt including the interactions between Drug and Pretreatment, and the sequential anova table is below.

## Analysis of Variance Table

## Response: PostTreatment

## Df Sum Sq Mean Sq F value Pr(>F)

## Pretreatment 1 802.94 802.94 48.4726 3.366e-07 ***

## Drug 2 68.55 34.28 2.0692 0.1482

## Pretreatment:Drug 2 19.64 9.82 0.5930 0.5606

## Residuals 24 397.56 16.56

## ---

## Signif. codes: 0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1 a) What is the overall sample size, n?

5 + 24 + 1 = 30

b) What hypothesis is being tested by the following row in the table, and what do you conclude from the result?

Pretreatment:Drug 2 19.64 9.82 0.5930 0.5606

Two equivalent ways to state the null hypothesis for this test:

1) H0 : No interaction between Pretreatment and Drug eﬀects

2) H0 : The additive model is adequate: PostTreatment ～ Pretreatment + Drug

We do not reject the null hypothesis (p = 0.56 > 0.05), so there is no signiﬁcant evidence of an interaction.

c) Based on the sequential anova results what is the best model for these data:

PostTreatment ～ 1

PostTreatment ～ Pretreatment

PostTreatment ～ Pretreatment + Drug

PostTreatment ～ Pretreatment + Drug + Pretreatment:Drug Explain why.

PostTreatment ～ Pretreatment. Reason: stepping backward from the full interaction model, the interaction terms are not signiﬁcant with the two main eﬀects in the model, and, dropping the interaction, the main eﬀect for Drug is not signiﬁcant with Pretreatment in the model.

d) Write out the model formula (in R syntax) for the model that corresponds to three parallel regression lines for PostTreatment versus Pretreatment for the three Drug groups.

PostTreatment ～ Pretreatment + Drug

e) Consider the following notation to express the variables in the data mathematically for the ith subject:

yi is the PostTreatment score,

xi is the Pretreatment score,

zi1 = 1 if Drug A and zi1 = 0 if not Drug A,

zi2 = 1 if Drug B and zi2 = 0 if not Drug B,

zi3 = 1 if Drug C and zi3 = 0 if not Drug C,

and ei is the error term. Using this notation, write out a valid, full rank mathematical form

of the model corresponding to the R formula:

PostTreatment ～ Pretreatment + Drug + Pretreatment:Drug

Use expressions like β0 , β1 etc. for the coeﬃcients of the model.

yi = β0 + β1 xi + β2 zi2 + β3 zi3 + β4 xi zi2 + β5 xi zi3 + ei , i = 1, 2, . . . , 30

Note that we only use two of the three indicator variables to distinguish the three Drug categories. We can tell it’s Drug A if zi2 = zi3 = 0. So we make Drug A the reference value, and β2 and β3 are incremental eﬀects of Drugs B and C, respectively, versus Drug A. The interactions are coded as products of the Pretreatment and Drug variables.

3. A cubic polynomial was ﬁt using the crossx variable as the response and energy as the predictor. The sequential ANOVA table for this ﬁtted model is as follows:

## Analysis of Variance Table

## Response: crossx

Df energy 1 I(energy^2) 1

I(energy^3) 1

Residuals 6

---

## Signif. codes:

Sum Sq Mean Sq F value Pr(>F)

272.216 272.216 1265.6028 3.289e-08 ***

10.980 10.980 51.0492 0.0003789 *** 0.622 0.622 2.8933 0.1398498

1.291 0.215

0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1

a) Based on the numerical results above, select one of the following options to describe the most appropriate model for this data set, and explain your choice:

1. A linear trend or simple linear regression of crossx on energy.

2. A quadratic polynomial model.

3. A cubic polynomial model.

4. We do not have enough information to select among the options above.

2. A quadratic polynomial model. Reason: Starting from the bottom of the sequential anova table, the cubic term is not statistically signiﬁcant, so we would drop it. The quadratic term I(energy2 ) is highly signiﬁcant so we keep it and stop.

b) In the sequential ANOVA table above, the F test corresponding to the quadratic term I(energy2 ) is calculated as the ratio of two numbers

1. Numerator: A= (use a number with two digits after the decimal)

2. Denominator: B= (use a number with two digits after the decimal)

Give the numerator A, denominator B, and degrees of freedom for this F test.

A = 10.98 ( I(energy2 ) Mean Sq), B = 0.215 (Residuals Mean Sq)

df = 1 and 6 (numerator and denominator)

c) Consider the following notation to express the variables mathematically for the ith observation: yi is the value for crossx, xi is the value for energy, and ei is the error. Using this notation, write out a valid mathematical form of the model corresponding to the cubic polynomial model. For the coeﬃcients, use expressions like β0 , β1 , etc.

yi = β0 + β1 xi + β2 xi(2) + β3 xi(3) + ei , i = 1, 2, . . . , 10

4. Each of the following R function calls create a set of basis functions. For each, give the degrees of freedom and the total number of knots. Note - problem should have speciﬁed that the lm function will include the intercept if it’s not already in the basis functions.

a) B-spline:

bs(year, df=6, intercept=TRUE)

df: 6

knots: 6-4=2

b) B-spline:

bs(year, df=8, intercept=FALSE)

df: 9 assuming the intercept will be included in the model, e.g. by lm

knots: 9-4 = 5

c) Natural Cubic Spline:

ns(year, df=8, intercept=TRUE)

df: 8

knots: 8-2=6

d) Natural Cubic Spline:

ns(year, df=10, intercept=FALSE)

df: 10+1=11

knots: 11-2=9

5. The following output summarizes the options for the ﬁrst step in a backwards stepwise selection process applied to a linear model for data relating mean life expectancy for U.S. states to other demographics. The ﬁrst column is the list of candidate variables for deletion. The RSS column shows the residual sum of squares for the model without the indicated variable, and the AIC column shows the AIC for that model.

## Single term deletions

## Model:

## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + ## Frost + Area

Df Sum of Sq RSS AIC

## <none>

## Population

## Income

## Illiteracy

## Murder

## HS.Grad

## Frost

## Area

a) Which, if any, variable would be removed if we use AIC as the selection criterion for backward stepwise regression? Explain why.

Area: dropping this variable gives the lowest AIC of -24.182

b) What is the AIC value for the model that includes all variables?

-22.185

c) The full model we started with has smaller RSS than any of the candidate models obtained by dropping one variable. Does this mean that the full model is actually the best option? Explain why or why not.

No. RSS only measures ﬁt, not complexity. We can drive down the RSS by adding more variables, but this leads to over-ﬁtting the particular data set. The resulting model would not be a good predictive model for new data. AIC is more reliable because it prevents over-ﬁtting by penalizing complexity.

6. Several questions about regularized regression.

a) When we use principal components regression, it is always best to use all principal components of the design matrix X for regression. True or False? Explain.

False. This would be equivalent to using all of the variables in X. We will obtain better predictions by reducing to a smaller set of principal components that account for a large percentage of the variation in X.

b) Consider the following output after computing the principal components of predictor variables:

## Importance of components:

## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.7548 1.2739 1.0025 0.74634 0.58222 0.50886 0.3713 ## Proportion of Variance 0.4399 0.2318 0.1436 0.07958 0.04843 0.03699 0.0197 ## Cumulative Proportion 0.4399 0.6717 0.8153 0.89488 0.94331 0.98030 1.0000

According to this output, how many principal components are needed to account for at least 90% of the variation in the predictor variables? Give the actual percentage of variation accounted for by this choice.

5 principal components. From the “Cumulative Proportion” row the ﬁrst 5 principal components account for 94.3% of the variation, whereas the ﬁrst 4 account for slightly less than 90%.

c) Lasso regression minimizes the residual sum of squares of a regression model subject to the constraint that j(p)=1 |βj | < t. Which of the following methods could be used to select the value for t:

1. Maximum likelihood

2. Least squares

3. Weighted least squares

4. Cross-validation

5. None of the above

4. Cross-validation. This provides an unbiased way to estimate which constrained model gives the best out-of-sample predictions.

7. The following results are from ﬁtting a one-way anova model to data relating blood coagulation times to diet.

## Call:

## lm(formula = coag ~ diet, data = coagulation)

## Residuals:

## Min 1Q Median 3Q Max

## -5.00 -1.25 0.00 1.25 5.00

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***

## dietB 5.000e+00 1.528e+00 3.273 0.003803 **

## dietC 7.000e+00 1.528e+00 4.583 0.000181 ***

## dietD 2.991e-15 1.449e+00 0.000 1.000000

## ---

## Signif. codes: 0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1 ##

## Residual standard error: 2.366 on 20 degrees of freedom

## Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212

## F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05

a) Based on these results, compute the estimated diﬀerence between the mean coagulation time for Diet B and the mean for Diet A. Also give the standard error for this diﬀerence if possible.

We can see from the results that Drug A is the reference group, so this mean diﬀerence is estimated by the dietB coeﬃcient. Estimate = 5.00, Standard Error

= 1.53

b) Below is the analysis of variance table for the model above.

## Analysis of Variance Table

## Response:

## diet

## Residuals

State the null hypotheses that the F value is testing. Does the test reject the null hypothesis at level 0.05?

H0 : µA = µB = µC = µD (all treatment group means are equal) p < .0001 so the test rejects the null hypothesis at level 0.05.

c) Based on the results below, determine which pairs of diet group means are signiﬁcantly diﬀerent from each other, controlling the family wise error rate at α = 0.05.

## Tukey multiple comparisons of means

## 95% family-wise confidence level

## Fit: aov(formula = g)

## $diet

##	diff	lwr	upr	p adj
## B-A	5	0.7245544	9.275446	0.0183283
## C-A	7	2.7245544	11.275446	0.0009577
## D-A	0	-4.0560438	4.056044	1.0000000
## C-B	2	-1.8240748	5.824075	0.4766005
## D-B	-5	-8.5770944	-1.422906	0.0044114
## D-C	-7	-10.5770944	-3.422906	0.0001268

The following pairs of mean diﬀerences have adjusted p-values < 0.05 and are therefore statistically signiﬁcant:

B - A (> 0), C - A (> 0), D - B (< 0), D - C (< 0)

Another way to think about this result is that the means are grouped as follows:

{A, D( < {B, C(

with signiﬁcant diﬀerences between the two bracketed groups, and no signiﬁcant diﬀerences within the bracketed groups.

8. Below is a scatter plot of some data with the ﬁtted line for a quadratic polynomial. There are two replicate observations for each value of x.

Here is some analysis of the data.

mod0 = lm(y ~ x + I (x ˆ2))

mod1 = lm(y ~ factor(x))

anova (mod0, mod1)

## Analysis of Variance Table

## Model 1: y ~ x + I(x^2)

## Model 2: y ~ factor(x)

## Res.Df RSS Df Sum of Sq F Pr(>F)

## 1 11 9.5614

## 2 7 4.0755 4 5.486 2.3557 0.1521

a) What is the null hypothesis for the F test in the analysis?

H0 : The quadratic model is adequate, in other words, for this F test the null model for the mean is E (y|x) = β0 + β1 x + β2 x2 .

b) What is the alternative hypothesis for the F test in the analysis?

Ha : The factor model holds, so each unique value for x yields a potentially diﬀerent value for E (y|x). In other words, the alternative model for the mean is E (y|x) = g(x) where the form of the function g has no constraints whatsoever.

c) Let x1 , x2 , . . . , x7 denote the unique values for x in the data. Let yi1 and yi2 denote the

two responses values for xi , i = 1, 2, . . . , 7. For “Model 2” listed in the analysis above, provide simple formulas for the ﬁtted values yˆi1 and yˆi2 .

Because yi1 and yi2 share the same value for xi , they have the same ﬁtted value. It is simply the sample mean of the replicate values:

yi1 + yi2

d) Based on Part c), provide a simpliﬁed formula for the residual sum of squares for “Model 2” (here simpliﬁed means using summation notation).

7 2

RSS = (yij - y¯i )2 ,

i=1 j=1

the within-groups sum of squares.

Not required, but a further simpliﬁcation is possible in this case because

2 (yij - y¯i )2 = (yi1 - yi2)2 .

j=1

e) The test given in the analysis is an example of a lack-of-ﬁt test. What do you conclude from the results? Use signiﬁcance level 0.05.

The F test for lack of ﬁt fails to reject the null hypothesis that the quadratic model is adequate, because p = 0.1521 > 0.05. So we would prefer the quadratic model to the unconstrained factor model.