STATS 330 Mid-Semester Test Semester 2, 2020
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 330
Semester 2, 2020
Statistical Modelling
STATS 330 Mid-Semester Test
Instructions:
1. Time allowed: 75 minutes.
2. Use the Rmarkdown file provided fro answering these questions.
3. Give us your details on the front page of this Rmarkdown file.
4. You may wish to use R to do some calculations.
5. In general, show your working if required.
6. Answers only require 2-3 short, informative, sentences.
7. There are useful equations on the final page of this test.
Biologists are interested in thee autumn migration of the Wood lark (Lullula arborea - pictured) at Hanko bird observatory. The observatory is located in southwestern Finland on a peninsula known to be a bottleneck for terrestrial bird migration in Finland’s autumn. We use migration data from the autumn migration season (1 September–10 November) in 2007–2009. The data consist of daily counts of actively migrating birds during the first four hours of daylight, starting from sunrise. Biologists know from past studies that the number migrating peaks in the middle of this season.
The data set WL.df has the following variables:
number
DOY
day
year
The number recorded for that day,
The day of the year 245=1st September, 315=10th November,
The day of the study 1=1st September, 71=10th November,
The year of study 7=2007, 8=2008, and 9=2009 .
1. Consider the following model and plots of these data. Do these plots support the opinions of past biologists? Comment briefly.
[5 marks]
## gam model smoother on day
gam.fit <- gam(number~ s(day), family = "poisson" ,
data = WL.df)
# plot smoother
plot(gam.fit)
0 10 20 30 40 50 60 70
day
It the numbers migrate peak mid-way through the season form the smoothing used in the gam model. This curve may be approximated by using a quadratic relationship when we wish to fit a parametric glm model.
2. Consider following analyses and models which attempt to capture the features of the gam model fitted above. Which of these models best fit these data? Show all calculations and
explain briefly.
[5 marks]
Hint: pchisq.
## Poisson
WL.poiss=glm(number~day+I(day^2), family=poisson, data=WL.df)
slimSummary(WL.poiss)
## Call:
## glm(formula = number ~ day + I(day^2), family = poisson, data = WL.df) ##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.693470 0.168888 -4.11 4e-05 ***
## day 0.173726 0.010319 16.84 <2e-16 ***
## I(day^2) -0.002631 0.000148 -17.73 <2e-16 ***
## ---
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2464.7 on 212 degrees of freedom
## Residual deviance: 1973.0 on 210 degrees of freedom
# negative binomial
WL.NB=glm.nb(number~day+I(day^2), data=WL.df)
slimSummary(WL.NB)
## Call:
## glm.nb(formula = number ~ day + I(day^2), data = WL.df, init.theta = 0.2679468613 ##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.380974 0.480715 -2.87 0.0041 **
## day 0.212087 0.030490 6.96 3.5e-12 ***
## I(day^2) -0.003060 0.000416 -7.35 1.9e-13 ***
## ---
## (Dispersion parameter for Negative Binomial(0.2679) family taken to be 1) ##
## Null deviance: 226.99 on 212 degrees of freedom
## Residual deviance: 182.56 on 210 degrees of freedom
# Poisson fit
deviance(WL.poiss); df.residual(WL.poiss)
## [1] 1973
## [1] 210
1-pchisq(deviance(WL.poiss), df.residual(WL.poiss))
## [1] 0
# NB fit
deviance(WL.NB); df.residual(WL.NB)
## [1] 182.56
## [1] 210
1-pchisq(deviance(WL.NB), df.residual(WL.NB))
## [1] 0.91455
Clearly the model WL.NB has a deviance that is just due to random variation indicating a good fit to these data. WL.poiss does not.
3. The following are the quantised residual plots for these two models, respectively. Explain what features of the Poisson model’s residuals that the negative binomial model addressed.
Explain briefly.
[5 marks]
# Poisson model
plot(fitted(WL.poiss),qresiduals(WL.poiss))
0 2 4 6 8
fitted(WL.poiss)
# NB model
plot(fitted(WL.NB),qresiduals(WL.NB))
0 2 4 6 8 10
fitted(WL.NB)
The quantised residuals for WL.poiss appear to have increasing variability with increase fit and appear to be right skew (and have greater overall variability). The quantised residuals of WL.NB look approximately Normal(0, 1) - which is ideal.
4. One of these biologists decided to fit a quasi-Poisson model (called WL.qpois) and noticed that it gave the same parameter estimates as the model WL.poiss. However, their standard errors (and subsequent output) differs.
Explain how the standard errors for the quasi-Poisson model WL.qpois differ from the stan- dard errors for the Poison model WL.poiss in terms of the estimated dispersion parameter obtained from WL.qpois
[5 marks]
# quasi Poisson
WL.qpoiss=glm(number~day+I(day^2), family=quasipoisson, data=WL.df) slimSummary(WL.qpoiss)
## Call:
## glm(formula = number ~ day + I(day^2), family = quasipoisson, ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.693470 0.644359 -1.08 0.28
## day 0.173726 0.039371 4.41 1.6e-05 ***
## I(day^2) -0.002631 0.000566 -4.65 5.9e-06 ***
## ---
## (Dispersion parameter for quasipoisson family taken to be 14.557) ##
## Null deviance: 2464.7 on 212 degrees of freedom
## Residual deviance: 1973.0 on 210 degrees of freedom
Hint: Estimated dispersion parameter=14.557.
The standard errors of the coefficients of WL.poiss are larger by a factor of ,14.557 = 3.8153 than those of WL.poiss. The more appropriate model WL.NB accounts for the greater variability than expected from the model WL.poiss.
summary(WL.qpoiss)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6934703 0.64435877 -1.0762 2.8306e-01
## day 0.1737262 0.03937094 4.4126 1.6325e-05
## I(day^2) -0.0026311 0.00056616 -4.6473 5.9345e-06
summary(WL.poiss)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6934703 0.16888807 -4.1061 4.0241e-05
## day 0.1737262 0.01031922 16.8352 1.3472e-63
## I(day^2) -0.0026311 0.00014839 -17.7309 2.4205e-70
summary(WL.qpoiss)$coef[,2];summary(WL.poiss)$coef[,2]
## (Intercept) day I(day^2)
## 0.64435877 0.03937094 0.00056616
## (Intercept) day I(day^2)
## 0.16888807 0.01031922 0.00014839
summary(WL.qpoiss)$coef[,2]/summary(WL.poiss)$coef[,2]
## (Intercept)
## 3.8153
day
3.8153
I(day^2)
3.8153
sqrt(summary(WL.qpoiss)$dispersion)
## [1] 3.8153
Consider the following information:
The quadratic equation y = a + bx + cx2 for c < 0 has a maximum value xmax , say, at
b
5. Use the above information to determine from the model WL.NB when the time of peak migration occurs —in terms of days starting from the 1st of September.
[5 marks]
We have that log(µi ) = β0 + β1 day + β2 day2
βˆ1 0.21209
2βˆ2 2 × (_0.00306)
6. Calculate the Pearson residual when day=36 in 2009 using the model WL.NB.
[5 marks]
Hints:
WL.df$number[WL.df$day==36&WL.df$year==9]
## [1] 21
predict(WL.NB, newdata=data.frame(day=36))
## 1
## 2.2884
WL.NB$theta
## [1] 0.26795
= exp(2.28844) = 9.85955.
r() = + = 9.85955 + = 372.65829.
and he Pearson residual is
r = = 0.57709.
#check
id= (WL.df$day==36&WL.df$year==9)
residuals.glm(WL.NB, type= "pearson")[id]
## 178
## 0.57709
7. A mathematically inclined STATS 330 student realised that if they take the mid-point value of day (i.e. day=36) they can fit a model with fewer parameters.
They fitted the following model and calculated the log likelihood for this model:
WL.NB2=glm.nb(number~I((day-36)^2),data=WL.df)
logLik(WL.NB2)
## 'log Lik.' -429.23 (df=3)
The log-likelihood statistic for WL.NB is:
logLik(WL.NB)
## 'log Lik.' -428.6 (df=4)
Calculate the AIC for each of these two models and decide which model is best based on the AIC criteria.
[5 marks]
AIC = _2loglik + 2p where p is the number of parameters estimated (including dispersion parameters) , and so
AIC(WL.NB) = 2 × (428.6045) + 2 × 4 = 865.21.
AIC(WL.NB2) = 2 × (429.22605) + 2 × 3 = 864.45. So we choose WL.NB2 as it has a lower AIC (just).
# check via R
AIC(WL.NB)
## [1] 865.21
AIC(WL.NB2)
## [1] 864.45
A biologist wished to ascertain when when the number of migrating birds was zero and decided to fit the following model. It was of particular importance to determine at what time periods there was less than a 50% chance of observing at least one bird.
WL.bin.gam=gam(I(number==0)~s(day),family=binomial, data=WL.df) plot(WL.bin.gam)
2022-01-19