Department of Statistics

STATS 330: Statistical Modelling

Assignment 1

Semester 1, 2021


Total: 65 marks Due:                                                                                                    23:59hrs NZT, Tuesday 23 March 2021


Notes:

(i) Write your assignment using R Markdown. Knit your report to either a Word or PDF document.

(ii) Create a section for each question. Include all relevant code and output in the final document.

(iii) Marks may be deducted for poor style. Please keep your code and plots neat. Make sure all plots have informative titles and axes labels.

(iv) Please remember to submit a signed cover sheet with your assignment.

(v) Please remember to upload your R Markdown file to Canvas before the deadline, too. If the markers identify an error in your work, being able to run the code you have written can help determine what you did wrong.


Question 1 - mainly revision from STATS 20x

In 1986, a paper was published in The British Medical Journal reporting on a comparison of the effectiveness of several different methods of removing kidney stones1 . We are interested in the outcomes of 350 study participants who received treatment A which involved surgical removal of the kidney stones (i.e., an invasive procedure), and 350 study participants who received treatment B which involved percutaneous nephrolithotomy (i.e., a noninvasive procedure).

1 Charig, C. R.. Webb, D. R., Payne, S. R., & Wickham, J. E. (1986). “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy.” Br Med J (Clin Res Ed), 292 (6524): 879–882.

The data set stones.csv contains the following variables:

treatment: the allocated treatment (either A for surgical removal or B for percutaneous nephrolithotomy)

size: the size of the kidney stone (either small or large)

success: whether the treatment successfully removed the kidney stones (either 0 for no or 1 for yes)

(a) We will first examine the relationship between success and treatment using tables of counts and odds. See STATS 20x notes if you need a refresher.

(i) Create a table of counts of success and treatment.

(ii) Calculate the odds of successful removal of kidney stones for study participants receiving treatment A.

(iii) Calculate the odds of successful removal of kidney stones for study participants receiving treatment B.

(iv) Calculate the ratio of the odds calculated in parts (ii) and (iii) and interpret in words. Which treatment is more successful for the removal of kidney stones?

[5 marks]

(b) We will now use a generalised linear model (glm) to explore the relationship between success and treatment.

(i) Identify the response variable and the explanatory variable.

(ii) Justify and fit an appropriate model for this exploration.

(iii) Interpret the model, being sure to communicate the uncertainty in your interpretation.

(iv) How do the model results compare to your answer in (a), part (iv) above?

[8 marks]

The analyses above included data from study participants with both large and small kidney stones. We will now split the study participants into two groups according to size.


(c) Repeat the calculations in (a), and do this separately for small and large kidney stones.

(i) Create two tables of counts of success and treatment, one for small kidney stones and one for large kidney stones.

(ii) Calculate the odds of successful removal of small kidney stones for study participants receiving treatment A.

(iii) Calculate the odds of successful removal of small kidney stones for study participants receiving treatment B.

(iv) Calculate the ratio of the odds calculated in parts (ii) and (iii) and interpret in words. Which treatment is more successful for the removal of small kidney stones?

(v) Calculate the odds of successful removal of large kidney stones for study participants receiving treatment A.

(vi) Calculate the odds of successful removal of large kidney stones for study participants receiving treatment B.

(vii) Calculate the ratio of the odds calculated in parts (v) and (vi) and interpret in words. Which treatment is more successful for the removal of large kidney stones?

[10 marks]

(d) Comment on what you discovered in (a) part (iv) and (c) parts (iv) and (vii). Are you surprised? If so, why? If not, why not? [5 marks]


Question 2

The Park Grass experiment2 was set up in 1865 on ancient grassland close to Rothamsted Manor in Hertfordshire with the objective of determining the combination of nutrients that would contribute to the species richness. Species richness refers to the number of plant species observed in a particular grassland plot.

In this experiment 90 grassland plots, each with differing nutrient levels as measured by biomass and soil pH, were examined.

2 Click here for more information

The data set species.csv contains the following variables:

pH: the pH of the plot soil (either low, mid, or high)

Biomass: the biomass of a plot (measured in mass per unit area, m/a)

Species: species richness, i.e., the number of plant species observed in a plot

(a) Create appropriate plots to explore the following relationships, and comment on what the plots tell you:

(i) Biomass and pH.

(ii) Biomass and Species.

(iii) Species and pH.

[6 marks]

(b) Create a plot that explores the relationship between Biomass, Species and pH. Comment on what the plot tells you.

Note: You may find some of the material in Tutorial 2 helpful.

[5 marks]

(c) We are interested in exploring the relationship between the condition of the soil as measured by Biomass and pH, and the number of plant species, Species.

(i) Identify the response and explanatory variables for this exploration. [3 marks]

(ii) What model would be appropriate to fit? Choose from linear regression, Poisson regression and logistic regression. Justify your choice. [3 marks]

(iii) Using your answers to parts (i) and (ii), run the following code for two models making sure you replace the blanks with the appropriate terms. Briefly interpret the models3. [4 marks]

model1<-glm(Species~Biomass*pH,

family=" ",data=species.df)

model2<-glm(Species~Biomass+pH,

family=" ",data=species.df)

summary(model1)

summary(model2)

(iv) Evaluate the adequacy of the models fitted in part (iii). Comment briefly. [2 marks]

(v) Calculate the change in residual deviance from model1 to model2 and calculate the probability of observing such a change given we have ‘lost’ 2 degrees of freedom (i.e., we have reduced the number of parameters in the model from 6 to 4). [3 marks]

Choose your preferred model, model1 or model2.

(vi) Construct a plot displaying the estimated effects of your preferred model (i.e., model1 or model2) on the response variable4 . Clearly communicate what your plot displays. [5 marks]

(vii) Based on your model, calculate and interpret 95% confidence intervals for the mean number of plant species observed in each of three separate plots, all of which have Biomass=5 but each with different levels of pH: low, mid, and high. [6 marks]

Hint: Use the option se.fit=TRUE in the function predict.glm. For information on the function predict.glm, type ?predict.glm.

3 Don’t worry about testing for goodness-of-fit or doing model selection - these topics were not covered when this assignment was released.

4 CSee the plots in Handout 2 for a good example of something similar from the chicken bacteria data analysis. Also see Tutorial 2.