Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


STATS 330

Semester 2, 2020

Statistical Modelling

STATS 330 Mid-Semester Test


Instructions:

1. Time allowed: 75 minutes.

2. Use the Rmarkdown file provided fro answering these questions.

3.  Give us your details on the front page of this Rmarkdown file.

4. You may wish to use R to do some calculations.

5. In general, show your working if required.

6. Answers only require 2-3 short, informative, sentences.

7. There are useful equations on the final page of this test.

Biologists are interested in thee autumn migration of the Wood lark (Lullula arborea - pictured) at Hanko bird observatory.  The observatory is located in southwestern Finland on a peninsula known to be a bottleneck for terrestrial bird migration in Finland’s autumn.  We use migration data from the autumn migration season (1 September–10 November) in 2007–2009.  The data consist of daily counts of actively migrating birds during the first four hours of daylight, starting from sunrise.  Biologists know from past studies that the number migrating peaks in the middle of this season.

The data set WL.df has the following variables:

number

DOY

day

year

The number recorded for that day,

The day of the year 245=1st  September, 315=10th  November,

The day of the study 1=1st  September, 71=10th  November,

The year of study 7=2007, 8=2008, and 9=2009 .

1.  Consider the following model and plots of these data.  Do these plots support the opinions of past biologists? Comment briefly.

[5 marks]

## gam model  smoother  on  day

gam.fit  <- gam(number~ s(day), family = "poisson" ,

data  = WL.df)

# plot  smoother

plot(gam.fit)

0          10         20         30         40         50         60         70

day

It the numbers migrate peak mid-way through the season form the smoothing used in the gam model.  This curve may be approximated by using a quadratic relationship when we wish to fit a parametric glm model.

2.  Consider following analyses and models which attempt to capture the features of the gam model fitted above.  Which of these models best fit these data?  Show all calculations and

explain briefly.

[5 marks]

Hint: pchisq.

## Poisson

WL.poiss=glm(number~day+I(day^2),  family=poisson,  data=WL.df)

slimSummary(WL.poiss)

##  Call:

##  glm(formula  = number  ~  day  +  I(day^2),  family  = poisson,  data  = WL.df) ##

##  Coefficients:

##                           Estimate  Std.  Error  z  value  Pr(>|z|)

##  (Intercept)  -0.693470     0.168888     -4.11       4e-05  ***

##  day                   0.173726     0.010319      16.84     <2e-16 ***

##  I(day^2)       -0.002631     0.000148    -17.73     <2e-16 ***

##  ---

##  (Dispersion parameter  for poisson  family  taken  to be  1)

##

##         Null  deviance:  2464.7    on  212   degrees  of  freedom

##  Residual  deviance:  1973.0    on  210   degrees  of  freedom

#  negative  binomial

WL.NB=glm.nb(number~day+I(day^2),  data=WL.df)

slimSummary(WL.NB)

##  Call:

##  glm.nb(formula  = number  ~  day  +  I(day^2),  data  = WL.df,  init.theta  =  0.2679468613 ##

##  Coefficients:

##                           Estimate  Std.  Error  z  value  Pr(>|z|)

##  (Intercept)  -1.380974     0.480715     -2.87     0.0041  **

##  day                   0.212087     0.030490       6.96    3.5e-12  ***

##  I(day^2)       -0.003060     0.000416     -7.35    1.9e-13  ***

##  ---

##  (Dispersion parameter  for  Negative  Binomial(0.2679)  family  taken  to be  1) ##

##         Null  deviance:  226.99    on  212   degrees  of  freedom

##  Residual  deviance:  182.56    on  210   degrees  of  freedom

# Poisson  fit

deviance(WL.poiss);  df.residual(WL.poiss)

##  [1]  1973

##  [1]  210

1-pchisq(deviance(WL.poiss),  df.residual(WL.poiss))

##  [1]  0

# NB fit

deviance(WL.NB);  df.residual(WL.NB)

##  [1]  182.56

##  [1]  210

1-pchisq(deviance(WL.NB),  df.residual(WL.NB))

##  [1]  0.91455

Clearly the model WL.NB has a deviance that is just due to random variation indicating a good fit to these data. WL.poiss does not.

3. The following are the quantised residual plots for these two models, respectively.  Explain what features of the Poisson model’s residuals that the negative binomial model addressed.

Explain briefly.

[5 marks]

# Poisson  model

plot(fitted(WL.poiss),qresiduals(WL.poiss))

0                    2                    4                    6                    8


fitted(WL.poiss)

# NB model

plot(fitted(WL.NB),qresiduals(WL.NB))

0                 2                 4                 6                 8                10


fitted(WL.NB)


The quantised residuals for WL.poiss appear to have increasing variability with increase fit and appear to be right skew (and have greater overall variability). The quantised residuals of WL.NB look approximately Normal(0, 1) - which is ideal.


4.  One of these biologists decided to fit a quasi-Poisson model (called WL.qpois) and noticed that it gave the same parameter estimates as the model WL.poiss. However, their standard errors (and subsequent output) differs.

Explain how the standard errors for the quasi-Poisson model WL.qpois differ from the stan- dard errors for the Poison model WL.poiss in terms of the estimated dispersion parameter obtained from WL.qpois

[5 marks]

#  quasi  Poisson

WL.qpoiss=glm(number~day+I(day^2),  family=quasipoisson,  data=WL.df) slimSummary(WL.qpoiss)

##  Call:

##  glm(formula  = number  ~  day  +  I(day^2),  family  =  quasipoisson, ##

##  Coefficients:

##                           Estimate  Std.  Error  t  value  Pr(>|t|)

##  (Intercept)  -0.693470     0.644359     -1.08         0.28

##  day                   0.173726     0.039371       4.41    1.6e-05  ***

##  I(day^2)       -0.002631     0.000566     -4.65    5.9e-06  ***

##  ---

##  (Dispersion parameter  for  quasipoisson  family  taken  to be  14.557) ##

##         Null  deviance:  2464.7    on  212   degrees  of  freedom

##  Residual  deviance:  1973.0    on  210   degrees  of  freedom

Hint: Estimated dispersion parameter=14.557.

The standard errors of the coefficients of WL.poiss are larger by a factor of ,14.557 = 3.8153 than those of WL.poiss. The more appropriate model WL.NB accounts for the greater variability than expected from the model WL.poiss.

summary(WL.qpoiss)$coef

##                            Estimate  Std.  Error  t  value     Pr(>|t|)

##  (Intercept)  -0.6934703  0.64435877  -1.0762  2.8306e-01

##  day                   0.1737262  0.03937094   4.4126  1.6325e-05

##  I(day^2)       -0.0026311  0.00056616  -4.6473  5.9345e-06

summary(WL.poiss)$coef

##                            Estimate  Std.  Error   z  value     Pr(>|z|)

##  (Intercept)  -0.6934703  0.16888807    -4.1061  4.0241e-05

##  day                   0.1737262  0.01031922    16.8352  1.3472e-63

##  I(day^2)       -0.0026311  0.00014839  -17.7309  2.4205e-70

summary(WL.qpoiss)$coef[,2];summary(WL.poiss)$coef[,2]

##  (Intercept)                 day        I(day^2)

##    0.64435877    0.03937094    0.00056616

##  (Intercept)                 day        I(day^2)

##    0.16888807    0.01031922    0.00014839

summary(WL.qpoiss)$coef[,2]/summary(WL.poiss)$coef[,2]

##  (Intercept)

##           3.8153


day

3.8153


I(day^2)

3.8153


sqrt(summary(WL.qpoiss)$dispersion)

##  [1]  3.8153

Consider the following information:

The quadratic equation y  = a + bx + cx2  for c  < 0 has a maximum value xmax , say, at

b

5. Use the above information to determine from the model WL.NB when the time of peak migration occurs —in terms of days starting from the 1st  of September.

[5 marks]

We have that log(µi ) = β0 + β1 day + β2 day2

βˆ1 0.21209

2βˆ2             2 × (_0.00306)

6.  Calculate the Pearson residual when day=36 in 2009 using the model WL.NB.

[5 marks]

Hints:

WL.df$number[WL.df$day==36&WL.df$year==9]

##  [1]  21

predict(WL.NB, newdata=data.frame(day=36))

##           1

##  2.2884

WL.NB$theta

##  [1]  0.26795

= exp(2.28844) = 9.85955.

r() = + = 9.85955 + = 372.65829.

and he Pearson residual is

r = = 0.57709.

#check

id= (WL.df$day==36&WL.df$year==9)

residuals.glm(WL.NB,  type= "pearson")[id]

##          178

##  0.57709

7. A mathematically inclined STATS 330 student realised that if they take the mid-point value of day (i.e. day=36) they can fit a model with fewer parameters.

They fitted the following model and calculated the log likelihood for this model:

WL.NB2=glm.nb(number~I((day-36)^2),data=WL.df)

logLik(WL.NB2)

##  'log  Lik.'  -429.23  (df=3)


The log-likelihood statistic for WL.NB is:

logLik(WL.NB)

##  'log  Lik.'  -428.6  (df=4)

Calculate the AIC for each of these two models and decide which model is best based on the AIC criteria.

[5 marks]

AIC = _2loglik + 2p where p is the number of parameters estimated (including dispersion parameters) , and so

AIC(WL.NB) = 2 × (428.6045) + 2 × 4 = 865.21.

AIC(WL.NB2) = 2 × (429.22605) + 2 × 3 = 864.45. So we choose WL.NB2 as it has a lower AIC (just).

#  check  via R

AIC(WL.NB)

##  [1]  865.21

AIC(WL.NB2)

##  [1]  864.45


A biologist wished to ascertain when when the number of migrating birds was zero and decided to fit the following model.  It was of particular importance to determine at what time periods there was less than a 50% chance of observing at least one bird.

WL.bin.gam=gam(I(number==0)~s(day),family=binomial,  data=WL.df) plot(WL.bin.gam)









<