Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Unless otherwise specified, assume α = 0.05 and 95% confidence.

The Final Exam Begins Here!

1. (1pt each unless otherwise noted) True / False and Short Answer….  Simply Highlight “True” or “FALSE” or write in the answer where appropriate.

a. True / False We must sample more than 30 subjects from a population in order for the sampling distribution of the sample mean to have a normal distribution.

b. True / False For a two-sided, two sample pooled confidence interval for the difference in means (µ12), if this confidence interval is (-4,-1) this is evidence that µ1 is larger than µ2

c. True / False The Rank-Sum test is resistant to outliers.

d. True / False We wish to test the equality of 2 group means. If the normality assumption is met, the two sample t-test is robust to the standard deviation assumption as long as the sample sizes of the groups are the same.

e. True / False In a randomized experiment, where the subjects are randomly assigned to the treatment groups, the result can always be generalized to the population the sample is taken from.

f. True / False Adding an interaction term between a categorical and continuous variables in a regression model allows the possibility for different slopes for different levels of the categorical variable.

g. True / False We will need 7 indicator variables to code a categorical variable with 7 levels.

h. True / False The model with the lowest AIC will always have the lowest external cross validation ASE (ASE on the test set).

i. True / False In regression, for a fixed value of the explanatory variable, the prediction interval for the response can be narrower than the confidence interval for the mean response.

j. True / False Assume we take a sample and find a 95% confidence interval for the mean to be (5,9). If we conduct a hypothesis test for on the same data, the pvalue will be less than .05.

k. Multiple Choice 2pts If a variable that is independent of the response is included in the model because its pvalue is less than alpha, what type of error has been made?

a. Type I Error b. Type II Error  c. No Error has been made

l. Short Answer 2pts: In regression, the confidence interval and prediction interval for the response are narrower for different values of the explanatory variable.  At what value of the explanatory variable are the confidence interval and prediction interval the narrowest?

Answer: Confidence interval and prediction interval are narrowest at the mean

m. (2 pts) Your company is trying out a new website to try and generate more business.  Assume your boss has asked you to compare the mean daily mouse click traffic on the company’s 5 different versions of its new website: Original Version, Ver 1, Ver 2, Ver 3, Ver 4.  You were asked to compare the mean click rate between each new website (Ver 1, Ver 2, Ver 3 and Ver 4) and the original website (Original Website).   You of course want to do the appropriate multiple comparison correction.  Which correction is most appropriate here?

Select one:

A. Dunnett

B. Tukey-Kramer

C. Bonferroni

1. Matching (1 pt each) Letters can be used more than once!

Assume we are testing the claim that the mean heart beats per minute (bpm) of males is 8.

i. Rejecting Ho when the true mean bpm is 83bpm. _a__

ii. The probability of failing to reject Ho when the true mean bpm is . _h__

iii. Failing to Reject Ho when the true mean bpm is . _b__

iv. The probability of rejecting Ho when the true bpm is .  _e__

v. When the effect size goes from 5bpm to 3bpm, what happens to the power?  ___m____

vi. When the significance level (alpha) goes from .05 to .01 what happens to the power? ____n___

a. Type 1 Error

b. Type 2 Error

c. R2

d. degrees of freedom

e.

f. Power

g.

h.

i. Extrapolation

j. pvalue

k. proportion

l. residual

m. increases

n. decreases


2. (1 pt) Who said, “All models are wrong, but some are useful.” ___George E. P. Box___________ ___________

Just in case … the Answer can be found here.

Questions 4 and 5: Use the Parameter Estimate Tables Below to answer questions 4 and 5.  The first column are 3 simple linear regression models in which the response is fit on each of the variables individually.  The second column contains 3 multiple regression models where the response is fit on all combinations of 2 variables in the model.   The final column is a fit in which all three variables are in the model at the same time.

Main Effect Models

(One variable models)

Two Variable Models

Three Variable Models

Model 1

Model Variables

P > |t|

Model 4

Model Variables

P > |t|

Model 7

Model Variables

P > |t|

X

.384

X

.001`

X

<.0001

Model 2

Model Variables

P > |t|

Y

.03

Y

.167

Y

.004

Model 5

Model Variables

P > |t|

Z

.005

Model 3

Model Variables

P > |t|

X

.082

Z

.01

Z

.006

Model 6

Model Variables

P > |t|

Y

.003

Z

.87


3. Conduct a forward selection using the parameter estimate table above. Assume the alpha to enter the model is .15.  Simply write down the variables in the final model (2 pts) :  ______________

4. Next conduct a stepwise selection using the parameter estimate table above. Assume the alpha to enter and leave the model is .15.  Simply write down the variables in the final model (2 pts) :  ______________

(2pts each) Questions 6 – 12: Match the letter from the plot to the description below. Letters may be used more than once:

_C__ 6. Least Cook’s D of the three.

__B, C_ 7.  Low Leverage and Low Residual

___ 8.  High Leverage and High Residual

__A_ 9.  High Leverage and Low Residual

(2pts each) The plots below correspond to the scatterplot above.  Fill in the blank next to each point with the appropriate letter from above.

Questions(13-14): Use the data, code and associated output to answer questions 13 and 14 below the output.

Consider the model:

The corresponding code and output below:


13. Find a 95% confidence interval for the LX2 slope and show the calculations. (3pt)

proc glm data = final plots = all alpha = 0.05;

model LY = LX1 LX2 / solution clparm;

run;

95% confidence interval (0.135, 0.227).

14. Interpret the slope of LX2 including the use of the confidence interval.  (3pt)

With 95% confidence, the mean LX2 response for 1 unit change is between (0.135, 0.227).

(Use for Questions 15-16) A study was done on children and young adults to study the relationship between the presence of stress hormone (Cortisol) and a person’s age and weight.

15. Assume the following model was fit:

Which parameter estimate table is most consistent with the relationships displayed in the scatterplots below? (2 pts)

Answer: B

16. Given the parameter estimate table you chose above, fill in the ‘t’ and Pvalue column for the Weight row you selected. (2 pts)

tstat = -9/3 = -3

pvalue = 2 * pt(-3,12) = 2 * .0055 = .011

17. Interpret the coefficient (slope) for the Age variable and include a 90% confidence interval. (2pts)

6 +/- *2

For any fixed weight, we are 90% confident that 1 year increase is age is associated with between a xxx unit and xxx unit decrease in mean Cortisol Level; our best estimate is a  6 unit increase.

MSDS 6371 ANALYSIS QUESTION: Fall 2022 (50 pts)

In the past 20 years, scientists have discovered that plastic has become an increasing threat to wildlife in the ocean.  It is also thought that with Asia’s rapid growth that there may be more pollution coming from Asia than from the Americas.  In order to test this hypothesis, 53 boats were deployed in the West Pacific Ocean and 54 boats were deployed in the East Pacific in order to measure the level of plastic in each region (in ounces).  Each boat collected all the plastic present on the surface of the water in a square kilometer.  The weight of this plastic was recorded in ounces and can be found in the data file provided: plasticEastWest.csv.

Interestingly, there is a form of bacteria that is thought to be evolving to actually eat plastic and use it as a source of nutrition.  The level of this bacteria in each square kilometer plot (Ideonella sakainesis) was also recorded by each boat recorded in bacterial per milliliter (bpml).

In addition to the plastic and Ideonella sakainesis  level in each square kilometer plot, the water temperature of each square kilometer was recorded by each boat as well.  Finally, you may assume that the square kilometer “plots” were far enough away from each other that they can be considered independent observations.

For all problems below assume an alpha = .05 level of significance and 95% confidence intervals.

a.  (13 pts) Conduct a two-sample t-test to test the claim that the mean plastic level (in ounces) is greater in the West Pacific than in the East. You may assume the assumptions are met for this test. For the actual test, please provide the six step test with a confidence interval and scope of inference that addresses causality (if the result is significant) and the generalizability of the result.  In addition, in conducting this test, you are assuming that the standard deviation of the plastic levels in the East and West Pacific are equal.  Provide the estimate of this standard deviation.

Problem: Test the claim that the mean plastic level (in ounces) is greater in the West Pacific than in the East.

Assumptions: Given that assumptions are met for this test

Step 1 - Hypotheses

Step 2 - Identification of Critical Value

Step 3 - Value of Test Statistic :

Step 4 - Give p-value: = .3586

Step 5 - Decision Fail to reject

Step 6 – Conclusion: There is not enough evidence to suggest at the level of significance (= .3586) that the mean the mean plastic level (in ounces) is greater in the West Pacific than in the East. A 95% confidence interval for this increase is .

Standard error = 111.8



b. (12 pts) Fit a simple linear regression model with the plastic level as the response and the Ideonella sakainesis (Bacteria) level as the explanatory variable.   Test for non zero slope by showing the 6 step test. You may assume the assumptions for this test are met. Provide a scatter plot with the regression line superimposed on the data. In addition, interpret the slope of the regression equation and include a confidence interval.  Finally, interpret the intercept and include a confidence interval.

1. Ho: = 0

Ha: ≠ 0

2. CV =

3. TS =

4. Pvalue = <.0001

5. Reject Ho.

6. There is sufficient evidence at the alpha = .05 level of significance (p-value = <.0001) to suggest that the y- intercept is nonzero. We will proceed to test the slope


Interpret the slope:

Temperature_slope ±t.025,6*SE

We are 95% confidence that for every 1 ounce increase of bacteria the estimated mean plastic decrease between (-10.4801 - * 0.15482, -10.4801 + *0.15482)

Interpret the intercept

Intercept ± t.025,6*SE


When the bacteria equals to 0, the estimated mean plastic is 4125.90615.

We are 95% confidence that the intercept is between (Intercept ± t.025,6*SE)



c. (10) pts Next the researchers wanted to test the claim the relationship (slope) between the Ideonella sakainesis  level (explanatory variable) and the plastic level (response) were different between the East and the West Pacific.  Add the necessary variable(s) to the model to test for this difference.  Provide a copy of the parameter estimate table along with a short (1 or 3 sentences) conclusion about if the slopes are significantly different between the East and the West supported with a pvalue and a confidence interval. You may assume the assumptions for your model are met.


d. (10 pts) Researchers have a special interest in a section of the Pacific Ocean about 20 miles off the coast of California (in the EAST Pacific Ocean).  The researchers know that the level of Ideonella sakainesis in the area is 170 bpml and would like to predict the plastic level in the East Pacific Ocean where the Ideonella sakainesis level is 170 bpml.  Write a short (1 to 3 sentence) response that includes an estimate of the plastic level for this individual sample with the appropriate interval (confidence or prediction.)


Predicted plastic level East Pacific Ocean Ideonella sakainesis level is 170 bpml:

Equation of Regression Line. Pred.plastic =


e. (10 pts) Later, the researchers hypothesized that the time of day may have an effect on their study and they wanted to test this variable as whole. They realized from their records that they could label each of their observations as either Morning, Afternoon, Evening or Night and they called this variable TimeOfDay.  They decided to test the inclusion of this variable as a whole by comparing the following two models:



Conduct an Extra Sum of squares test by filling out the corresponding ANOVA table below to test:


Source

DF

Sum of Squares

Mean Square

F Value

Pvalue

Model

1

34607530.94

34607530.94

4582.41

<.0001

Error

105

792987.02

7552.26

Total

106

35400517.97

In addition to filing out the table above, you only need to provide the conclusion (step 6).   You do NOT need to do the entire 6 step test.


BONUS (Only if you have time.):


a. (max 3 points) Researchers wanted to revisit the claim that the mean plastic level is different in the West and East Pacific (similar to what was tested in part (a) although this time they wanted to use additional information: the level of Ideonella sakainesis in the water as well. Fit a model that regresses the plastic level (response) against the bacterial level (explanatory variable) and assumes equal slopes for the East and the West to test this claim.  Again, provide a copy of the parameter estimate table of the model used to answer this question as well as a scatter plot with the regression lines superimposed.  Again, provide a short (1 to 3 sentence) providing guidance as to which mean is greater or if there is no evidence to suggest a difference. Your analysis should be supported by pvalue(s) and confidence interval(s). You may assume the assumptions for your model are met.

b. (max 3 pts) Provide a partial residual plot displaying the relationship of the partial residuals of the plastic level adjusted for the temperature against the Ideonella sakainesis l level. Provide a parameter estimate table for the fit of these partial residuals against the bacteria level and comment on the coefficient (slope).

c. (max 3pts) Given the parameter estimate table and variance covariance table below, calculate a confidence interval for the difference in intercept between the morning and evening levels.  Show you work.