Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Research School of Finance, Actuarial Studies and Statistics

EXAMINATION

Semester 1 – Final, 2017

STAT2008 Regression Modelling

Question 1 (15 marks)

The faraway library includes a data frame called cheddar, which contains data from a study of cheddar cheese from the La Trobe Valley in Victoria. The concentration of Lactic acid, along with the concentrations (on a log scale) of both Acetic acid and H2S (hydrogen sulphide) were measured from 30 samples of cheese, which were then subjected to taste tests. Overall taste   scores were obtained by combining the scores from several tasters.

(a) A multiple regression model (cheddar.lm) has been fitted to these data and the summary

output from this model is given at the top of page 2 of the R output, but the analysis of variance (ANOVA) table is not shown. Fill in the details of the ANOVA table in the   spaces shown below:

Df

Sum Sq

Mean Sq

F value

Pr(>F)

H2S

Lactic

Residuals

[Hint: rounding errors will accumulate as you derive entries in this table from other values shown in the R output, so do NOT round the results of intermediate                 calculations. DO round all your final answers in the above table to 2 decimal places. You may also have to use the statistical tables to estimate one or more of the

p-values, or you can receive the marks for showing appropriate critical values.] Working

(3 marks 1 for each row of the ANOVA table)

(b) Residual plots for the model in part (a) are shown on pages 2 and 3 of the R output.

Do these plots suggest any problems with the underlying assumptions?

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 2? If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 3? If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 3? If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)

Residuals are not independent (obvious pattern)

Residuals do not have constant variance (heteroscedasticity)

Residuals are not normally distributed

There are possible outliers and/or influential observations

More than one of the above problems

No obvious problems

(2 marks 0.5 for each section)

(c) For each of the following five diagnostic measures shown on page 4 of the R output,       calculate the relevant cut-off value suggested in the lecture notes and discuss whether or not this cut-off is appropriate in this instance. Which observations, if any, exceed each

of the cut-off values?

The leverage or hat values (hii)

The externally studentised residuals (ti)

DFFITS

COVRATIO

DFBETAS

Given your answers above and considering the residual plots in part (b), are there any observations that are vertical outliers and/or highly influential observations?  Should some observations be removed and the model re-fit to the remaining data?

(7 marks 1 for each of the first 5 sections and 2 for the last summary section)

(d) Output for a second model (cheddar.lm2) is shown on page 5 of the R output, which   includes an additional term added to the initial model described in the earlier parts of this question. Is the term involving Acetic a significant addition to a model which       already includes H2S and Lactic? Give full details of an appropriate hypothesis test. (3 marks)

Question 2 (15 marks)

The US Centers for Disease Control and Prevention (CDC) use data from the National Health and Nutrition Examination Survey (NHANES) to develop a series of clinical growth charts    for assessing healthy growth ranges in boys and girls. The data frame kid.weights in the UsingR library contains a sample of 250 observations taken from the NHANES data. The data frame  contains the age (in months), weight (in pounds) and height (in inches) for 129 girls (gender = F) and 121 boys (gender = M), with age ranging from 3 months to 144 months (12 years).

(a) Page 6 of the R output shows code used to fit a series of models to these data. Residual

plots are given on page 7 for growth.lm3, the last of this series of models. Do these plots suggest any problems with the underlying assumptions for model growth.lm3?

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 7? If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 7? If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 7? If so describe the problem(s):

What is your overall assessment? (select just ONE of the following options)

Residuals are not independent (obvious pattern)

Residuals do not have constant variance (heteroscedasticity)

Residuals are not normally distributed

There are possible outliers and/or influential observations

More than one of the above problems

No obvious problems

(2 marks 0.5 for each section)

(b) On page 8 of the R output, there is also some summary output for the model growth.lm3,

including a few residual diagnostics. Use this summary output and your answers to part (a) to comment on the following issues:

Observations 228, 9 and 158 were highlighted in some of the residual plots. Which of the diagnostics on page 8 could you use to test if these observations are vertical outliers? Are these observations really outliers or do they suggest some other         problem with the underlying assumptions?

Is growth.lm3 an appropriate model for the kid.weights data? If not, how might we modify this model?

(2 marks)

In the summary(growth.lm3) output on page 8 of the R handout, most of the summary  statistics and the partial regression coefficient for the interaction term boy:height have been removed and replaced by question marks. Calculate all five missing statistics.    [Show all necessary formulae and working and round your final answers to no more  than 3 significant figures, as rounding errors will accumulate.]

Estimated coefficient for the boy:height term

The residual standard error and the corresponding degrees of freedom

Multiple R-squared

Adjusted R-squared

The F-statistic and the corresponding degrees of freedom

(5 marks)

(d) The indicator variable boy is equal to 1 for each male observation and is 0 otherwise     (when the observation is a girl). This indicator variable was created at the end of page 6 of the R output and was included in the model growth.lm3.

The model growth.lm2 is also shown on page 6 of the R output, but has been turned  into a comment, so that the output for this model is not shown. What does the model growth.lm2 suggest is the form of the relationship between weight and the                  explanatory variables included