Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MAST10010 Data Analysis 1

Semester 2, 2021

Question 1 (2 marks)

A study is investigating whether or not vaping (using an electronic cigarette) a↵ects lung func- tion after exercise. The researchers measured the forced expiratory volume (FEV), the partici- pant completed 20 star jumps, then the FEV was measured again immediately afterwards. This study is

(a) an experiment because the researchers measured the FEV.

(b) an experiment because the researchers made the participants exercise by doing star jumps. (c) an observational study because they observed the lung function as measured by FEV.  (d) an observational study because the researchers did not decide who did/did not vape.   (e) none of the above.

Question 2 (2 marks)

Two independent random variables A and B have standard deviations 12 and 5 respectively. The standard deviation of  (A − B) is:

(a) 13.

(b) ^119.

(c) 7.

(d) 13/2.

^            

(e)  119/2.

(f) 7/2.

Question 3 (2 marks)

Similar to the Balls and Buckets activity from lectures, 1000 samples of the same sample size were taken from a single population. For each sample, a hypothesis test was conducted and a P-value calculated, these are graphed below:

Which of the following statements are TRUE?

(a) the alternative hypothesis is true because a majority of the P-values are lower than 0.05.

(b) the alternative hypothesis is true because more than 5% of the P-values are lower than

0.05.

(c) the alternative hypothesis is true because some hypothesis tests had P-values lower than 0.05.

(d) the null hypothesis is true because almost all of the P-values are greater than 0.05. (e) it is not possible to decide whether or not the null hypothesis is true.

Question 4 (2 marks)

A random variable T = X1+X2+···+X80, where the Xis are independent identically distributed random variables with an unknown distribution, mean µ and standard deviation σ . Which of the following is FALSE?

(a) the standard deviation of T will be 80σ .

(b) the mean of T will be 80µ .

(c) if the Xis are normally distributed, then T will also be normally distributed.

(d) if the Xis are not badly skewed, then T will be approximately normally distributed. (e) the correlation between X1  and X2  will be zero.

Question 5 (2 marks)

Which of the following statements is FALSE?

(a) µ is the expectation of  .

(b)  is a random variable.

(c)  is a realisation of µ .

(d)  is an estimate of µ .

(e)  is an estimator of µ .

Question 6 (2 marks)

Which of the following statements is TRUE?

(a) a wider confidence interval means less confidence in the estimate (all other things being

equal).

(b) a narrower confidence interval is always better than a wider confidence interval.

(c) for a 95% confidence interval for a population mean, there is a 95% chance that the confidence interval includes the sample mean.

(d) the width of a confidence interval is a↵ected by both the sampling error and the required level of confidence.


Question 7 (2 marks)

A study on whether logging increases sediment (lowers water quality) is being conducted. The null hypothesis is that there is no di↵erence in sediment comparing logged areas to those left intact. Rejecting the null hypothesis will mean that logging within water catchments is stopped. Which of the following statements is FALSE?

(a) the probability that logging in water catchments a↵ects the water quality is ↵, the signif- icance level.

(b) a Type I error would mean that logging in water catchments is stopped, when this would

not benefit the water quality.

(c) a Type II error would mean that logging in water catchments continues, even though the water quality will be lowered.

(d) a true negative would be when logging in water catchments continues, and the water quality is not a↵ected.

(e) a true positive would be when logging in water catchments is stopped, and the water

quality would be a↵ected.

(f) more than one of these is false.


Questions 8 and 9 refer to the following information:

There is know racial bias in the diagnosis of pain in the USA, with African American patients less likely to receive pain medication when presenting with the same levels of pain as white patients. A hospital is interested in whether their doctors are similarly biased. A random sample of patients presenting with extreme pain is studied.

The data obtained are:

Given medication  Not given medication  Total

African American

12

17

39

White

28

14

42

Question 8 (2 marks)

Consider the proportions receiving pain medication in African American (p1 ) and white (p2 ) patients.

For a test of the hypotheses

H0 :    p1 = p2

H1 :    p1  p2

the correct standard error to use for this test is

(a) 0.103.

(b) 0.111.

(c) 0.659.

(d) 4.500.

(e) none of the above.

Question 9 (2 marks)

Considering the information above, which of the following is TRUE?

(a) a confidence interval based on these data would have a di↵erent estimate of the di↵erence. (b) a confidence interval based on these data would have the same standard error.

(c) if the alternative hypothesis was H1 : p1  < p2 , then the p-value would be larger.

(d) a χ2  test would give the same P-value.

Question 10 (2 marks)

A company is testing their mobile designs, to determine which is the best at entertaining infants in a crib. They have 6 di↵erent designs they are testing, and 35 infants whose parents have consented for them to be in the study. When conducting an ANOVA, the degrees of freedom for the test statistic will be:

(a) 6 and 35.

(b) 5 and 29.

(c) 1 and 33.

(d) 35.

(e) 29.

(f) it is not possible to determine, as the groups cannot be equal.

Question 11 (9 marks[3 + 3 + 3 = 9 marks]

(a) “When the distribution of the data is strongly skewed, reporting the mean is uninforma-

tive . ”

Discuss this statement, and why it is not entirely correct.

Compare and contrast point estimates and interval estimates. Your answer should in- clude at least one commonality, and at least one di↵erence.

(c) Compare and contrast Fisher and Tukey intervals, in the context of Analysis of Vari- ance (ANOVA). Your answer should include at least one commonality, and at least one di↵erence.


Question 12 (5 marks[2 + 1 + 2 = 5 marks]

Electricity distributors are assessing the risk of power outages caused by falling trees. They have divided the crucial above-ground connections into approximately 5,000 segments of equal length. These segments have been categorised into 3 groups based on the potential impact to the electricity grid (low, moderate and high); the segments are not equally divided between these groups. To assess whether a particular segment is vulnerable, they need to send a trained ar- borist to examine all trees near/overhanging the power line. This is expensive, so the electricity distributor only wants to assess approximately 100 segments.

(a) What is the research question the electricity distributor is assessing?

Explain how randomisation could be applied in this study.

Explain what is meant by replication, and how it could be applied in this study.



Question 13 (11 marks[3 + 3 + 4 + 1 = 11 marks]

Smoke from bushfires is a common source of PM2.5 pollution (particles which have a diameter of 2.5 micrometres or smaller). After a small bushfire in the area, the data from many monitoring stations was investigated. The data are measured in micrograms per cubic metre (µg/m3 ), and an exploratory data analysis produced the following graphs and statistics:

(a) A report on these data was given as:

“Air pollution data (as measured by PM2.5 levels) showed a very slight positive skew, with a median value of 25.9µg/m3 .”

Identify two things which should be added to improve this comment (you should be specific).


(b) Explain why it is reasonable to calculate a confidence interval for µPM2.5 , the mean

pollution level, even though this calculation requires an assumption of normality and the data are skewed.

(c) Calculate a 90% confidence interval, using the data above and some of the following output. Use at least 3 decimal places in your calculations (where possible), and show your working.

(d) Using only your interval from part (c), what could you conclude about a test for the

hypotheses

H0 : H1 :

using a signicance level of ↵ = 0.10.

 

µPM2.5 = 25 µPM2.5  25


Question 14 (5 marks[2 + 3 = 5 marks]

The researchers involved in Question 13 were also interested in how mountain ranges a↵ected the spread of pollution from a bushfire. In Victoria, most weather travels from west to east. A re near Balmoral led to smoke reaching Gariwerd (also known as the Grampians). A large number of PM2.5 sensors had been set up on both the western and eastern sides of Gariwerd

and an average reading for 24 hours for each sensor was recorded.

Some alternative analyses on these data were performed:


Which of the three tests is the most appropriate, and why? Clearly state your reasoning.

(b) For the most appropriate output selected in (a) above, write a conclusion in the style of a research report. Marks will be awarded for being clear and concise.


Question 15 (5 marks[1 + (1+1+2) = 5 marks] 

A PhD student is involved in the early stages of planning an experiment to test the e↵ect of two possible Alzheimer’s treatments. The experiment will induce Alzheimer’s disease into mice, treat them with the two treatments, and then evaluate their problem solving ability (as measured by time to complete a maze).

Due to ethical concerns, the PhD student is only allowed 60 mice, and has decided to allocate these equally (and randomly) to the two treatment groups (30 mice per group).

(a) Why is it preferable to have equal groups in this experiment?

(b) Based on previous research, the PhD student has decided she would like to be able to

identify an average di↵erence of 4 seconds, as this would represent a significant impair- ment or recovery in the mice. She prepared the following graph to help with her sample size analysis: 

(i) What significance level has the PhD student decided to use?

(ii) How can you tell, from the graph, that her calculation is based on a two-sided alternative hypothesis?

(iii) How would the power change if she needed to detect a smaller di↵erence (eg 2 seconds)?



Question 16 (11 marks[6 + 2 + 3 = 11 marks]

In Australia, there is a distinction between plastic surgeons and cosmetic surgeons. Plastic sur- geons are regulated (they require at least 12 years of training, and membership of a professional body) whereas cosmetic surgeons are not (only requiring a basic 4 year medical degree). De- spite the di↵erences, most Australians use the terms “plastic surgery”and“cosmetic surgery” interchangeably.

(a) A small random survey found that 23 of the 25 people surveyed agreed with the statement

“A plastic surgeon and a cosmetic surgeon are the same.” Use the Minitab output to conduct a hypothesis test to determine if the proportion di↵ers from 0.5. You should state your hypotheses, calculate an appropriate test statistic, give a range for the P- value, and state your conclusion in the context of the survey.



(b) In part (a) you have used a test based on an approximation. Was this appropriate?

(c) A larger survey is being planned, and the desired margin of error for a 95% confidence interval is 0.05. Assuming that the proportion who agree with this statement is not smaller than 80%, how large a sample is required?


Question 17 (6 marks)

A study was examining energy consumption for hotels in Lagos, Nigeria (P.O. Oluseyi, O.M. Ba- batunde and O.A. Batatunde, 2016). Each data point represents a single hotel.

A simple linear regression was performed, predicting Energy Consumption using Floor Area, and the following graphs were produced: 

What are the assumptions for this regression analysis? Comment on whether or not they are satisfied, stating your reasoning clearly.

Question 18 (15 marks[3 + 3 + 3 + 3 + 1 + 2 = 15 marks]

Another study examined energy consumption for hotels in Singapore (R. Priyadarsini, W. Xuchao and L.S. Eang, 2009). Each data point represents a single hotel.

A simple linear regression was also performed for the Singapore data. Some of the Minitab output produced is included below.

(a) Give the equation for the alternative hypothesis model being applied. Your notation

needs to be clear and/or explained.

(b) How well does the model describe the data for Singapore? Your comment needs to

include a clear description supported by one statistic (Hint:  you need to  calculate  this from the given information).

(c) Calculate a 95% confidence interval for the slope of the model for hotels in Singapore, using the following output. Interpret your interval in the context of these data.

Minitab identified two points with large standardised residuals, and two points with high leverage. Explain the di↵erence between these two types of unusual points in regression.

Predict, with 95% confidence, the mean energy consumption for hotels in Singapore with a floor area of 40,000m2 .

Explain the di↵erence between a confidence interval and a prediction interval. You may want to refer to previous parts of this question and/or Minitab output.


Question 19 (14 marks[2 + 1 + 2 + 3 + 3 + 3 = 14 marks]

Jamie, a drug and alcohol counselor, conducted a survey to investigate whether or not usage rates of alcohol and other drugs were a↵ected by lockdowns. The survey was conducted in the tenth week of a lockdown, and asked 267 participants “Which of the following have you used in the past week (select all that apply)?”. Given the amount of data collected, Jamie decided to combine all of the “Other drugs”into a single category. The data were compared to historical proportions, prior to any lockdowns. The results of their analysis are below.

(a)

Explain why it is a good idea to ask about the last week, rather than the last month.

State the null and alternative hypotheses being tested by this chi-square test.



(c) Show how the expected count (104.13) and contribution (5.4718) have been calculated for the “Alcohol only”group.

(d) Specify the distribution of the test statistic if the null hypothesis is true. Use the output below to determine a range for the P-value.

 (e)  Based on this analysis, is there evidence that drug and alcohol consumption has changed

during the lockdown? Your comments need to be sufficiently detailed to get full marks for this question.

(f) In order to get a large sample, Jamie posted about the survey on Facebook, and also gave information about the survey to clients. Is this a problem? Why/why not?