Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MAST90044  Thinking and Reasoning with Data

Semester 2 2023

Assignment 2

Due: 4pm, Friday 5 May

Instructions

• Assignments are to be saved as a .pdf once complete, and submitted (uploaded) via GradeScope.

•  You must sign the plagiarism declaration. The link is available on the subject’s Canvas website.

• You should use the Word document template for Assignment 2 on the subject’s Canvas website. Please label your assignment in the appropriate spots at the top of this document. If you have used an AI tool to assist you in completing this assignment, you should indicate this and name the tool(s) you used.

• Assignments count for 50% of the assessment in this subject.  This one is worth 15%, and covers the work done in Weeks 4 to 6.

• The total number of marks for this assignment is 59.

• Your assignment should show all working and reasoning, as marks will be given for method as well as for correct answers. Please spell check your document.

• Paste any R code and output into the appropriate places so that it can be seen easily along with your other work. Graphics from R can be resized within your document; make them smaller as necessary.

• Tutors will not help you directly with assignment questions.  However, they may give some help with R if you ask, e.g. what does the hist() function do?

• Please note that we may mark only a subset of questions.

• Any extensions need to be approved by Liam.   Please email Liam if you need an extension.   Late assignments are penalised with a 20% reduction per day. Any assignment submitted more than 3 days (72 hours) after the due date without an extension will receive a score of 0.

• Solutions to the assignment questions will be made available later.

• When constructing a panel of graphs with multiple plots, it is good to use the R command

par(mfrow  =  c(nrows,ncols)) where nrows is the number of rows and ncols the number of columns in the panel. The default is (1,1). 

Q.1. Return to the obstructive sleep apnea dataset in Q.2. from Assignment 1.

(a) The census projections for the proportion of each age group in Sao Paulo are:

Age     Projections (%)  

20–29             25.1

30–39             24.2

40–49             21.2

50–59             15.5

60–69              9.0

70–80              5.0

Are all age groups appropriately represented in the study? Use a hypothesis test, making sure to state and check all relevant assumptions.

(b) Perform a hypothesis test for the claim in (2e) of Assignment 1 (you don’t need to verify any assumptions this time). Be sure to state the null and alternative hypotheses in terms of the model parameter.

Consider the claim you (informally) stated in (2a) of Assignment 1. Our objective is to now formalise this claim and test it against the data.

(c) Perform a hypothesis test to assess whether there is a meaningful relationship between AHI and your variable of interest (you don’t need to show a particular relationship, e.g. increase/decrease, only that there is one). Be sure to state and check all relevant assumptions.

(d) Now we want to measure the association between AHI and your variable of interest. Compute the correlation coefficient and comment on the result. Is this consistent with your original claim? For two ordinal variables, the correlation coefficient is valid and is called the Spearman  correlation. Are these variables ordinal, or are they nominal? Why? [3 + 6 + 8 + 5 =  22 marks]

Q.2. Periodic measurements of salinity and water flow were taken in North Carolina’s Pamlico Sound, re-

sulting in the following data (x = water flow, y = salinity):

 

x

23

24

26

25

30

24

23

22

22

24

25

22

22

22

24

y

7.6

7.7

4.3

5.9

5.0

6.5

8.3

8.2

13.2

12.6

10.4

10.8

13.1

12.3

10.4

 

(a) Read the data into R and produce a suitable graphical summary (with meaningful labels) of the relationship between water flow and salinity.

(b) Write down an appropriate statistical model for examining the relationship, and fit the model in R. Use the regression summary output to determine the correlation coefficient between x and y .

(c) Examine appropriate diagnostic plots, and comment on anything that is noteworthy or that may challenge the assumptions of the model.

(d) Find a 99% confidence interval for the slope of the line.   Interpret and comment on the rele- vance/usefulness or otherwise of the estimated slope and intercept.

(e) Find a 95% prediction interval for the salinity when the water flow is 21. Explain its meaning.    [3 + 5 + 5 + 6 + 3 =  22 marks]

Q.3. Varicella-zoster is the common herpes virus responsible for chickenpox  (commonly affecting young

children and adults) and shingles (commonly affecting older adults).  While the initial onset of the virus  (chickenpox) is rarely fatal in children, often motivating  chickenpox parties”, the virus lays dormant after recovery and reemerges later in life as shingles, and sees far higher mortality rates in older individuals.  While varicella vaccines do exist to prevent the initial onset of chickenpox (thus preventing shingles in older adults), the lifecycle of the virus renders cost-benefit analyses challenging. 

To inform local policy, researchers at the University of Saskatchewan, Canada developed a complex probabilistic agent-based model to predict the prevalence of varicella infections under a variety of vaccination and treatment strategies.  Each simulation is expensive, so inferring concrete conclusions with a limited number of runs is vital. This model has recently been adapted to accurately represent an Australian population: the following table presents the average (per year) number of shingles cases in five simulated populations of 100,000 people over a ten year period under three vaccination strategies.

No vaccination    One dose    Two doses

695.3 657.4 693.1 704.1 640.3

631.6

495.7

613.9

642.7

483.7


468.6

445.6

467.5

570.7

379.5

(a) Perform two two-sample t-tests to compare differences in the population means between the no vaccination group vs. the one dose group; and the one dose group vs. the two dose group. Report all relevant quantities. What are the underlying assumptions here? Are they reasonable?  (Note: there are multiple comparisons here!  Your Type I error should be no more than α = 0.05; 95% significance level)

(b) Suppose that each row of the above table uses common  random  numbers:  the only factor that changes within each row is the strategy; all other factors remain constant. This pairs the simula- tions within each row. Repeat any test that failed to reject H0  in part (a) as a paired t-test. Do you find any different conclusions? [9 + 6 =  15 marks]

Total marks  =  59