Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Understanding Data and Statistical Design (60117)

Assessment Task 2: Data Analysis Assignment

Spring 2022

Q1 & Q2 DATA

The data for Q1 and Q2 is contained in the file “q1q2data.csv”. The variables in this file are summarised in the table below.

Name

Type

Description

poison

experimental factor

type of poison (1-3)

tℎerapy

experimental factor

therapy administered to treat poison (1-4)

time

response

survival time of animal (10sof hours)

The data records the survival time (variable time) of animals randomly allocated a type of poison (variable poison) and randomly allocated a medical therapy to treat the poison (variable tℎerapy).

To read the data into R, run the getwd() function and save the CSV file in the location    returned. Alternatively, use the setwd function to point R to the location where the CSV file is saved. Then run the line of code below.

q1q2.data <- read.csv("q1q2data.csv", header=TRUE,

colClasses=c("factor","factor","numeric"))

QUESTION 1. Observational experiment [14 marks]

In this question we assess the survival time (variable time) of animals administered a variety of poisons. The statistical model for the analysis is

timen  = μ + En,         n  {1,2,  ,48},

where

   timen is the survival time of then-th animal

•   μ is population meantime

   En is the random effect on time of then-th animal.

(a) Construct a histogram  of time and superimpose over this a normal density curve fitted to the sample [2 marks]. Citing evidence from the plot, determine if the sample looks to be approximately normally distributed [2 marks].

From the histogram and density map, the data is not normally distributed because the right tail of the density map is very long, and the data is skewed to the right.

 

(b) Using significance level a = 0.05, perform a test to determine if population meantime of survival is greater than 4.2 hours. Write down the null and alternative hypotheses [1  mark],  the  test  statistic  and  associated  p-value   [1  mark],  the  test   decision (providing  a  reason  for  this)   [1  mark]  and  a  conclusion  using  a  minimum  of mathematical language [1 mark].

Hypotheses

Test Statistics

The test statistics t=1.7593

H0 : μ2  = 0.42

HA : μ2  > 0.42

with p-value=0.04252

Test Decision

Reject null hypothesis asp<0.05

Conclusion

There is strong evidence that population meantime of survival is greater than 4.2 hours.

 

(c) From the R output for part (b) you will have noticed the 95% confidence interval 0.42297 ≤ μ < ∞ .

Verify this is correct by performing your own calculation [2 marks].

One-sided confidence interval for the mean

X >  ta (n  1)

 

Thus, the 95% confidence interval is from 0.42297 to ∞ .

(d)Using significance level a = 0.05, perform a test to determine if population median time  of  survival  is   different  to   5.3  hours.  Write   down  the  null  and  alternative hypotheses  [1  mark],  the test  statistic  and  associated p-value  [1  mark],  the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].

Hypotheses

Test Statistics

The test statistics t=416

H0 : μ2  = 0.53

HA : μ2   0.53

with p-value=0.07849

Test Decision

Retain null hypothesis asp>0.05

Conclusion

There is not enough evidence that population median time of survival is different to 5.3 hours.

 

QUESTION 2. Two-factor experiment [16 marks]

In this question we continue the analysis from Q1, but this time also considering the factors poison and tℎeTapy.

(a) Write down the statistical model for a 3 × 4 factorial experiment that could give rise to the sample data we are considering, excluding interaction between the factors [2 marks]. Identify the experimental units [2 marks].

In this study, a 3 × 4 factorial completely randomized design (CRD) experiment was used, with 12 treatments repeated four times each, and a total of 48 observations.

The statistical model is described as

timei,j,n  = μ + ai  + βj  + Ei,j,n

where

•    i ∈ {1,2,3},j ∈ {1,2,3,4}, n ∈ {1,2,3,4}

•    timei,j,n   is  the  survival  time of the animal  at the n-th experiment with poison i and tℎeTapy j

•    μ is the global mean time

•    ai  is the treatment effect on time of poison i

•    βj  is the treatment effect on time of tℎeTapy j

•    Ei,j,n  is the random effect on time at then-th experiment with poison i and tℎeTapy j.

The components of the experiment design:

•    experimental factor A – type of poison (variable poison) which has 3 levels

•    experimental factor B – therapy administered to treat poison (variable tℎeTpy) which has 4 levels

•    treatments – the 12 combinations of levels of each factor

•    experimental units – each of the 12 groups of 4 animal samples to which the 12 treatments are randomly allocated

•    measurement units – the 48 animal samples

•    response variable – survival time of the animal (variable time)

(b) Using significance level a = 0.05, perform two-way ANOVA (without interaction) and document  the F -test  for  the  factor  poison .  Write  down  the  null  and  alternative hypotheses  [1  mark],  the test  statistic  and  associated p-value  [1  mark],  the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].

Hypotheses

Test Statistics

The test statistics f=20.86

H0 : β1  = β2  = β3  = 0

HA : at least one βj   0

with p-value=5.11*10-7

Test Decision

Reject null hypothesis asp<0.05

Conclusion

There is strong evidence that at least one poison affects the survival time of animals differently than others.

 

(c) Using significance level a  = 0.05, documenta normality test on the residuals for the analysis in part (b). Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].

Hypotheses

H0 : Tℎe residuals  i,n  are normally distributed

HA : Tℎe residuals  i,n  are not normally distributed

Test Statistics

The test statistics w=0.92202    with p-value=0.003506

Test Decision

Reject null hypothesis asp<0.05

Conclusion

There is strong evidence that the residuals are not from a normal distribution.

 

(d)Using  significance  level  a = 0.05 ,  perform  Tukey  post-hoc  analysis  on  the  factor

tℎeTapy and determine which levels have statistically different means [2 marks].

We see that therapy3 and therapy4 duration influences mean survival time of the animal that is statistically different from therapy1.

The t values of therapy1-therapy3 and therapy1-therapy4 were all less than 0.05, which were statistically significant.

 

(e) Using   diagnostic   plots   of   the    residuals,   assess   whether   the    assumptions   of independence and constant variance have been met [2 marks].

Independence.

There are no obvious patterns in the Residuals vs Fitted plot, so no problem with this assumption.

Constant variance.

The range of the residuals in the Residuals vs Fitted plot appears to increase, indicating a potential problem with this assumption.

 

Q3 & Q4 DATA

The data for Q3 and Q4 is contained in the file “q3q4data.csv”. The variables in this file are summarised in the table below.

Name

Type

Description

TiveT

length

categorical predictor continuous predictor

0 (Lumber), 1 (Waccamaw)

length of fish (cm)

weigℎt

meTcUTY

continuous predictor continuous response

weight of fish (g)

mercury concentration (ppm)

The data records mercury concentration and attributes of fish caught in two rivers in North Carolina.

To read the data into R, run the getwd() function and save the CSV file in the location    returned. Alternatively, use the setwd function to point R to the location where the CSV file is saved. Then run the line of code below.

q3q4.data <- read.csv("q3q4data.csv", header=TRUE,

colClasses=c("factor",rep("numeric",times=3)))

QUESTION 3. Simple linear regression [14 marks]

In this question we build a simple linear regression to model the relationship between

meTcUTY and lengtℎ . We consider the population model

meTcUTY = β0  + βl   lengtℎ + E

where var(E) = σ 2 .

(a) Fit the model described above, write down the regression equation  [1 mark] and calculate the predicted average mercury level of a fish with length equal to the 0.75 quantile of the sample of lengtℎ [2 marks].

Regression equation

mY(lengtℎ) = −1. 1316 + 0.0581  lengtℎ

mY(lengtℎ = 0) = −1. 1316 + 0.0581  0 = −1.1316

mY(lengtℎ + 1) = −1. 1316 + 0.0581  (lengtℎ + 1)

−1. 1316 + 0.0581  lengtℎ + 0.0581

= mecUTY(̂)(lengtℎ) + 0.0581

0.0581 ∗ 46.2  1.1316 = 25.7106

Therefore, it can be seen from the regression equation that the average mercury level of fish at 75% quantile length is 25.7106.

 

 

(b) Write down the model’sestimate of σ2  [2 marks].

According to the result of regression output, the standard error of the residual is 0.5805, and the degree of freedom is 169, so the variance of the residual is:

σ 2  = 0.5805  0.5805 = 0.33698

(c) Using 0.05 significance level, test whether average mercury level increases by less than 0.065ppm for each additional centimetre of fish length. Write down the null and alternative hypotheses [1 mark], the test statistic [1 mark], the test decision with reason  [1 mark] and a  conclusion using a minimum of mathematical language  [1 mark].

Hypotheses

H0 : β1   0.065

HA : β1  > 0.065

Test Statistics

t = 0.050(8)2(1)1(2)3(7)615(− 0).065 = −0.0321747

Test Decision

Retain H0  ast < t0.95  = 1.96

Conclusion

There is not strong evidence that average mercury level increases by less than 0.065ppm for each additional centimetre of fish length.

(d)Using appropriate diagnostic plots, determine if the modelling assumptions appear to have been satisfied [3 marks].

Normality

The Normal Q-Q plot shows the residuals tracking the line representing normality, indicating compliance with this assumption.

Constant variance

The Residuals vs Fitted plot shows the range of the residuals to be fairly consistent, indicating a compliance with this assumption.

Independence

The Residuals vs Fitted plot shows no obvious patterns in the residuals indicating compliance with this assumption.

 

(e) Is there any statistical evidence of autocorrelation in the residuals [2 marks]?

There is no strong evidence of autocorrelation in the residuals as the DW statistic is between 1 and 3.

 

QUESTION 4. Multiple linear regression [16 marks]

In this question we extend the model from Q3 into a multiple linear regression.

(a) Create a scatterplot of the variables meTCUTY and weigℎt and colour code the plot according to  levels  of TiveT  [2  marks].  Discuss  the  need  for  an  interaction  term between the predictors TiveT and weigℎt [2 marks].

There  is  some  evidence  of  different  slopes  according  to  river,  suggesting  need  for interaction term.

 

We now consider the population model

meTCUTY = β0  + Y ∗ TiveT1 + βl  ∗ lengtℎ + βw  ∗ weigℎt + δ ∗ TiveT1 ∗ weigℎt + E where

TiveT1 = { 1(0)    TiveT(Tive) w(R)T).

Note that R will create the dummy variable TiveT1 automatically.

(b) Fit the model described above, write down the regression that applies for the Lumber

River [1 mark] and provide interpretations of the estimated coefficientsβ(̂)0  andδ(̂) [2

marks].

Regression

T = −0.96452 + 0.05724 ∗ lengtℎ − 0.00018 ∗ weigℎt

The coefficient  β0(̂)  = −0.96452  is predicted mercury for Lumber River when

length=weight=0.

The coefficient δ  = 0.00032  is predicted mercury for Waccamaw River when weight stays the same, length stays the same, but it’s not 0.

 

(c) Using 0.05 significance level, determine if the interaction term is significant. Write down the null and alternative hypotheses [1 mark], the test statistic [1 mark], the test   decision   with   reason   [1   mark]   and   a   conclusion   using   a   minimum   of mathematical language [1 mark].

Hypotheses

H0 : β1  = 0

HA : β1   0

Test state

t = 0(0).(.)0001026(0003207) = 3.12573

Test decision

Reject H0  ast > t0.95  = 1.96

Conclusion

When the significance level is 0.05, the interaction terms are very significant.

(d)Calculate the predicted average mercury level for a fish of length 37.9cm and weight 607g  caught  in  the  Waccamaw   River   and  the   associated  95%  two-sided   mean confidence interval [2 marks]. You will need to construct a data frame containing this new data point.

Length 37.9cm and weight 607g caught in the Waccamaw River

the fitted value=1.4378453

95% mean prediction interval: [1.25944 ,1.6162484]

 

Below are diagnostic plots of the residuals for the model fitted above.

 

We see that the modelling assumptions have not been satisfied.

Sometimes transforming the response variable and fitting a model with the transformed response can result in a model that does satisfy the assumptions.

Here we take the response variable meTCUTY to the power of 1/5 and consider the population model

meTCUTY 1/5  = β0  + Y ∗ TiveT1 + βl  ∗ lengtℎ + βw  ∗ weigℎt + δ ∗ TiveT1 ∗ weigℎt + E where

TiveT1 = { 1(0)    TiveT(Tive) w(R)T).

(e) Fit the model described just above, write down the fitted regression equation for the Lumber River [1 mark] and produce diagnostic plots of the residuals [1 mark]. Have the modelling assumptions been satisfied for this model [1 mark]?

Regression equation

 = −0.5359 + 0.01341 ∗ lengtℎ − 0.00006249 ∗ weigℎt

 

Normality

The Normal Q-Q plot shows the residuals tracking the line representing normality, indicating compliance with this assumption.

Constant variance

The Residuals vs Fitted plot shows the range of the residuals to be fairly consistent, indicating a compliance with this assumption.

Independence

The Residuals vs Fitted plot shows no obvious patterns in the residuals indicating compliance with this assumption.