Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MAST10010 Data Analysis 1

Semester 2, 2020

Question 1 (2 marks)

In order to assess the e↵ects of exercise on reducing cholesterol, a researcher sampled 50 people from a local gym who exercised regularly and 50 people from the surrounding community who did not exercise regularly. They each reported to a clinic to have their cholesterol measured. The subjects were unaware of the purpose of the study, and the technician measuring the cholesterol was not aware of whether subjects exercised regularly or not. This is an example of:

(A) An observational study.

(B) An experiment, but not a double-blind experiment.

(C) A double-blind experiment.

(D) A matched pairs experiment.

Question 2 (2 marks)

You are designing an experiment to determine the e↵ectiveness of a new treatment. You suspect that age is a possible confounding variable. In designing your study, which one of the following strategies is NOT useful for avoiding this possible confounding e↵ect?

(A) Controlling (specifying) which age categories (e.g. 20-29) get treatment, to counteract

this e↵ect.

(B) Blocking/stratification: group subjects into age categories (such as 20-29, 30-39, etc.)

and then use age group as a blocking variable.

(C) Statistical adjustment to allow for the e↵ect of age on the response variable.

(D) Randomisation: assigning interventions at random to the subjects, irrespective of age.

Question 3 (2 marks)

Which one of the following is likely to be well described by a Binomial distribution?

(A) The number of years between floods at a certain location.

(B) The number of accidents in a large factory during one 8-hour shift.

(C) The number of tosses of a fair coin until the 10th head is obtained.

(D) The number of apples that are‘infected’in a sample of 40 apples randomly selected from a large consignment of apples.

Question 4 (2 marks)

It has been reported that high levels of dioxin are associated with heavy exposure to Agent Orange, a herbicide sprayed in South Vietnam between 1962 and 1970.

The dioxin concentrations (in parts per trillion) for a sample of army combat personnel who served in Vietnam during 1967 and 1968 were recorded. The distribution of these readings was strongly skewed to the right (positively skewed). Therefore, it is best to describe the distribution by reporting:

(A) The mean and standard deviation.

(B) The mean, median and mode.

(C) The ve-number summary.

(D) The correlation.

Question 5 (2 marks)

A researcher wishes to construct a 95% confidence interval for a population mean. She selects a simple random sample of size 8 from the population. The population is normally distributed and σ is unknown. When constructing the confidence interval estimate, the researcher uses the table value from the Normal distribution tables. The (actual) confidence level of her resulting confidence interval estimate will be:

(A) Exactly 95%.

(B) Exactly 90%.

(C) Greater than 95%.

(D) Less than 95% but greater than 90%.

Question 6 (2 marks)

A study was undertaken to compare moist and dry storage conditions for their e↵ect on the moisture content (%) of white pine timber. The report on the findings from the study included the following statement:

“The study showed a significant di↵erence (observed di↵erence =7.85%-6.75%=1.1%; p- value=0.023) in the moisture content of the pine timber under di↵erent storage conditions. Level of Significance (↵) for the test was 5%”

Based on this information, which of one the following statements is necessarily FALSE?

(A) The observed di↵erence between the mean moisture contents (1.1%) is unlikely to be

due to chance alone.

(B) The probability that there is no di↵erence between moist and dry storage conditions is

0.023.

(C) If the researchers used a large enough sample size then even a tiny observed di↵erence could result in a statistically significant di↵erence.

(D) Assume there is, in fact, no di↵erence between the storage conditions. If this study were repeated 100 times, then we would expect to (incorrectly) conclude there was a di↵erence in the storage methods for approximately 5 of the 100 studies (that is, 5% of the time we would say there was a di↵erence in the storage methods when, in fact, there was none).

Question 7 (2 marks)

During an angiogram, heart problems can be examined via a small tube (catheter) threaded into the heart from a vein in the patient’s leg.

It is important that the company who manufactures the catheter maintain a diameter of 2.00mm. Each day, quality control personnel make several measurements to test

H0  : µ = 2.00 vs H1  : µ  2.00 at a significance level of 0.05.

If they discover a problem, they will automatically stop the manufacturing process until it is corrected.

Based on the information provided, which one of the following statements is FALSE?

(A) A type 1 error in this scenario occurs when the quality control personnel stop the man-

ufacturing process when, in fact, the mean diameter of the catheters is 2.00 mm.

(B) A type 2 error in this scenario occurs if the quality control personnel do not stop the

process when the mean diameter of the catheters being produced made the catheters useless for threading into a heart vein

(C) The quality control personnel will correctly stop the manufacturing process on approxi- mately 95% of occasions when catheters of incorrect diameter are being produced.

(D) The quality control personnel will incorrectly stop the manufacturing process on approx- imately 5% of occasions when catheters of the right diameter are being produced.

Question 8 (2 marks)

In a statistics course, a linear regression equation was computed to predict the final exam score (y) from the score on an assignment (x). The equation of the least-squares regression line was

y = 10 + 0.7x

Suppose Jim scored 80 on the assignment and 83 on the nal exam. What would be the value of the residual corresponding to those scores?

(A) 17

(B) 37

(C) 3

(D) -3


Questions 9-15 refer to the study conducted by Ruan Q. et al.  2020 and pub- lished in Intensive  Care Med  (2020) 46:846- 848 .

The researchers used the database of Jin Yin-tan Hospital and Tongji Hospital to perform a retrospective study of 68 death cases (68/150, 45%) and 82 discharged cases (82/150, 55%) with laboratory-confirmed infection of Covid-19.

Laboratory results (markers) showed that there were di↵erences between the two groups (dead and discharged). The most interesting markers are: cardiac troponin (cardiac regulatory protein), myoglobin, C-reactive protein (CRP) and interleukin-6 (IL-6, produced in response to infections). The box-plots of patients’results are shown in the graph below.

 

Figure 1: Box-plots of laboratory results of some markers of interest measured on patients with confirmed Covid-19

Question 9 (2 marks)

Which one of the following statements is TRUE?

(A) For those who died from Covid-19, the range of values in the third quartile for cardiac

troponin marker is narrower than the range of values for interleukin-6 in the second quartile.

(B) The range of the interleukin-6 marker for discharged people is between 0 (mgL) and

12.5 (mgL).

(C) Four of the discharged people had interleukin-6 values smaller than the Q1   minus 1.5⇥IQR.

(D) The medians of the two examined groups for the C-reactive protein marker are farther apart than the medians for the interleukin-6 marker.

Question 10 (2 marks)

Which statistical test should you use if you want to compare the means for the C-reactive protein marker for the patients in the two groups?

(A) A t-test with the assumption that the population standard deviation is the same in both

groups. After all, these are all patients with a similar disease.

(B) A paired t-test, everyone has Covid-19.

(C) A t-test where the population standard deviation will be assumed to be di↵erent.

(D) Sign test, it is a paired comparison as everyone has Covid-19.

Question 11 (3 marks)

The results of the C-reactive protein marker for patients who died can be approximated by a Normal distribution. Looking at the box-plot, estimate the SD of the C-reactive protein marker

results. Briefly explain why your estimate is sensible.

Question 12 (3+3+5 marks)

The proportion of patients who died was 68 out of 150.

(a) Calculate the 95% confidence interval for the proportion of patients who died from

Covid-19 in the hospitals.

(b) The reported death rate in other hospitals in the world (including other hospitals in

China) is between 10% and 30%. What can you conclude about the sample that the researchers used for their analysis? Provide a brief explanation.

In a di↵erent hospital in Italy similar information was collected. In the Italian hospital, the proportion of patients who died was 81 out of 240. A young doctor, that read the Ruan Q. et al. (2020) paper, is interested to examine if the mortality rates recorded in Chinese and Italian hospitals are di↵erent.

(c) Conduct a hypothesis test to determine if there is a di↵erence between the mortality rates (proportions) in the Chinese and Italian hospitals.

In your answer you should clearly state/calculate the following:

(+) the hypotheses (in terms of the parameter/s of interest)

(+) the test statistic and its distribution under the null hypothesis      (+) your conclusion using P-value or critical value ( 5% significance level) (+) the conclusion of the test in the context of the problem

Question 13 (1+3+5+1+1+2+4 marks)

A Data Analysis 1 student wants to compare the cost of living in di↵erent continents. For this purpose, he runs an ANOVA analysis on the dataset from the lab test. Recall that the dataset contains information about the cost of living in di↵erent cities worldwide. For each city, the dataset also includes the continent to which it belongs.

A partial ANOVA table and some output from Minitab is given below:

Factor Information

Factor    Levels   Values

Continent 6   Africa, Asia, Europe, North America, Oceania, South America

Source

DF

Adj SS

Adj MS

F-value

P-value

Continent

 

 

 

31.49

0.000

Error

 

131,999

 

 

Total

439

179,888

 

95% CI

(29 .68, 46 .82)

(36 .73, 44 .12)

(54 .28, 58 .94)

(65 .30, 72 .83)

(63 .95, 82 .96)

(36 .30, 49 .74)

Pooled StDev = 17 .4398

Give the two hypotheses/models being compared in the ANOVA table.

Give the sresid  for each model and comment on which model appears to be more appro- priate for these data.

Complete the ANOVA table above by lling in the blank spaces in the table. You can use the blank space below for working.

Give the distribution of the test statistic under the null hypothesis, including the degrees of freedom.

Based on the ANOVA table, what can you conclude about the di↵erence between the mean cost of living in the six continents (you can assume that the assumptions of the model are satisfied)?

 (f) Examine the output for the Fisher’s method of calculating confidence intervals. Give an

example of two continents with a statistically significant di↵erence in their mean costs of living. Also, give an example of two continents with a non-significant di↵erence in the mean cost of living.

Grouping Information Using the Fisher LSD Method and 95% Confidence

Continent    N      Mean    Grouping

Oceania        13   73 .454   A

North America   83   69 .07   A

Europe         216  56 .61      B

South America  26   43 .02        C

Asia            86   40 .42        C

Africa          16   38 .25        C

Means that do not share a letter are significantly different .

(g) Explain the di↵erence between Tukey’s method and Fisher’s method for calculating

condence intervals.

Question 14 (2+1+2+2 marks)

The University of Texas Southwestern Medical Centre ran a survey to investigate if there is a relationship between the chance of getting infected with Hepatitis C and whether a person got a tattoo in a commercial parlour, got tattooed in a di↵erent venue (i.e. elsewhere), or does not have a tattoo (let’s call this variable‘location’). The centre surveys 626 individuals. The collected data is summarised in the following Minitab output:

Chi-Square Test for Association: Has hepatitis C, Location

Rows: Has hepatitis C   Columns: Location

Commercial Parlor

Elsewhere

No Tattoo

All

No                     35

53

491   579

48.10

56.42

474.48

Yes                     17

8

22    47

3.90

4.58

38.52

All

Cell Contents

Count

Expected count

Chi-Square Test

 

Chi-Square   DF   P-Value

Pearson

57.912          2     0.000

Likelihood Ratio

39.025          2     0.000

2 cell(s) with expected counts less than 5.

(a) State the null and alternative hypotheses being tested by this Chi-squared test.

(b) Show how the expected count (3.90) has been calculated for the number of people that

reported being infected with Hepatitis C after getting a tattoo in a commercial parlour.

(c) Comment on the association, if any, between the chance of being infected by Hepatitis C and the‘location’variable.

(d) Are there any limitations to the Chi-Squared test in this case? If yes, explain the limitation/s and what should we do in this case.

Question 15 (2+2+4 marks)

For the past eight months many pharmaceutical companies are trying to develop a safe and e↵ective vaccine for COVID-19. At the same time, many clinical teams are trying to determine e↵ective treatments for patients.  The data in both cases is collected via experiments and observations.

(a) Think of the data collected at hospitals during day-to-day care for COVID-19 patients.

Select any two of the common problems with observational studies and explain them in the context of this situation.

Briefly explain why observational studies are problematic for inferring causation.

(c) Chose one principle of good study design that reduces the bias and one principle that increases the precision. Explain how these two principles could be implemented in the context of clinical trails of a new COVID-19 vaccine.