MAT00042M Advanced Regression Analysis 2020/21
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MAT00042M
MMath and MSc Examinations 2020/21
Advanced Regression Analysis
1 (of 3) . An epidemic has occurred across two countries, country A and country B . In
a study over the three years of the epidemic, each country recorded n = 36 independent monthly observations on the number of cases . These are denoted by y1 , . . . , yn in country A, and by y , . . . , y in country B .
In country A, each outcome random variable Yi , i = 1, . . . , n is assumed to fol- low a Poisson distribution Poi(λi ), where λ 1 , . . . , λn are unknown parameters . Similarly, in country B, each outcome random variable Yi\ , i = 1, . . . , n is as- sumed to follow a Poisson distribution Poi(λ ), where λ , . . . , λ are unknown parameters .
In each country, the study is investigating the association between the number of cases (defined above) and the number of months that passed since the outbreak, denoted by xi = i, with i = 1, 2, . . . , 36 (see Figure 1) .
Country A Country B
0 data.pdf |
5 |
10 |
20 Recorded month |
25 |
30 |
35 |
0 |
5 |
10 |
20 Recorded month |
25 |
30 |
35 |
Figure 1: Left: Observations for country A (dots) and fitted model modA (solid curve));
Right: Observations for country B (dots) and fitted model modB .3 (solid curve)) . Both models will be introduced in Question 2 .
(a) Recall that in general, the Poisson probability mass function associated to Y ~ Poi(λ) is given by P(Y = y) = λy e_λ /y!, y = 0, 1, 2, . . . . Show that the probability function above belongs to the exponential family of distributions . Using the formulae for the mean and variance of a random variable following the exponential family of distributions, find the mean and variance for the Poisson distribution . [10]
(b) For country A, write down the generalised linear model (GLM) modA de- fined using a log-link function and a linear predictor ηi = β0 + β1i, for each month i . Derive the log-likelihood associated to model modA and write down the equations that lead to obtaining the maximum likelihood estimators of the unknown parameters, = (βˆ0 , βˆ1 )T (but do not at- tempt to solve these) . How would you obtain the estimated expected number of cases for month i in country A, once you obtained under this model? [9]
(c) Show that the deviance for model modA at point (b) has the form DmodA = 2 ← (yi log(yi /i )), where µi = E(Yi ) and i denotes its estimator under model modA . [10]
2 (of 3) . The study in Question 1 yielded observations that are plotted in Figure 1 for
both countries A and B .
After almost two years of epidemic, on month 23, the two countries agreed to relax the set of rules to control the disease, and this resulted in a marked increase in the number of cases, evident in both plots . In the dataset, this rule change is recorded by means of an indicator variable, with value 0 for months up to (and including month 22), and value 1 for the months thereafter . In R, this factor covariate is denoted by step .
The following R modelling uses the notation i (a numerical continuous vari- able) for the covariate denoting the number of months that passed since the outbreak . Furthermore, a quadratic covariate i2 is also computed and denoted by i2 . The response is denoted by y for country A, and by yy for country B .
Trimmed R output from a few fitted generalised models appears in what fol- lows, exploring for each country whether the linear predictor should involve the covariate i linearly or as a quadratic, thus also including i2, and whether the relaxation measure quantified through the covariate step did indeed bear a significant effect on the number of cases .
¿ summary(modA)
Call:
glm(formula = y ˜ i, family = poisson(link = ”log”))
Coefficients:
Estimate Std . Error z value Pr(¿—z— ) (Intercept) 0 .528252 0 . 181166 2 .916 0 .00355 ** i 0 .066275 0 .006796 9 .752 ¡ 2e- 16 ***
---
Signif . codes: 0 ‘***’ 0 .001 ‘**’ 0 .01 ‘*’ 0 .05 ‘ . ’ 0 . 1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 154 .911 Residual deviance: 46 .507 AIC: 177 . 12
on 35
on 34
degrees of freedom
degrees of freedom
¿ summary(modA .3)
Call:
glm(formula = y ˜ i + step, family = poisson(link = ”log”))
Coefficients:
Estimate Std . Error
(Intercept) 0 .81985 0 .20257
i 0 .03444 0 .01297
step 0 .73580 0 .26123
---
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 154 .911 Residual deviance: 38 .357 AIC: 170 .97
on 35
on 33
degrees of freedom
degrees of freedom
¿ t(matrix(c(1,37,1),nrow=3))%*%vcov(modA .3)%*%matrix(c(1,37,1),nrow=3) [,1]
[1,] 0 .01351348
¿ t(matrix(c(1,37,0),nrow=3))%*%vcov(modA .3)%*%matrix(c(1,37,0),nrow=3) [,1]
[1,] 0 . 1110813
¿ summary(modB .3)
Call:
glm(formula = yy ˜ i + i2 + step, family = poisson(link = ”log”))
Coefficients:
Estimate Std . Error
(Intercept) 2 .8886411 0 . 1040854
i 0 .0632756 0 .0105897
i2 -0 .0006862 0 .0002304
step 0 .7326556 0 .0810130
---
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1549 . 11 on 35 degrees of freedom
Residual deviance: 374 .54 on 32 degrees of freedom
AIC: 590 .49
¿ 1-pchisq(8 . 15,df=1)
[1] 0 .004306116
¿ 1-pchisq(374 .54,df=32)
[1] 0
¿ 1-pchisq(38 .36,df=33)
[1] 0 .2394046
¿ 1-pchisq(46 .51,df=34)
[1] 0 .07469905
In your analysis below, you may take the significance level α = 0.05 if needed, and you may use all reported R output .
In addition to the p-values reported under the R output, you are also given the quantiles z(0.975) s 1.96 , χ1(2)(0.95) s 3.84 and χ4(2)(0.95) s 9.49 .
(a) Define the AIC and BIC and explain their use . Derive a formula that would allow you to compute the BIC directly from the AIC . Hence, or
otherwise, obtain the BIC for model modA . [6]
(b) Using the R output, assess whether the covariate step is needed in the
model modB .3 for country B . Justify your answer . [5]
(c) Construct an analysis of deviance to compare models modA and modA .3 . Clearly justify your model choice through stating the tested hypotheses, the test statistic and its distribution . Does the factor variable step bear a significant effect on the number of cases in country A? [8]
(d) Justify whether model modA .3 is a good fit for the observed data in country A . Using model modA .3, proceed to estimate the number of cases corresponding to month 37, and compute its associated 95% confidence interval . [8]
3 (of 3) . As the number of cases is high in country B, a researcher suggests that the
Poisson distribution used for modelling the response may be replaced with a Gaussian distribution .
The researcher first models the independent monthly observations as Yi\ ~ N(µ , (σ\ )2 ), and assumes a common, unknown variance (σ\ )2 across all i = 1, . . . , 36 months . Following some residual checks, the researcher then decides to model Yi\\ = log(Yi\ ) ~ N(µ\ , (σ\\ )2 ) across the recorded i = 1, . . . , 36 months .
Trimmed R output along with some residual check plots (Figure 2) from the researcher’s investigation are reported below .
¿ summary(gmodB . 1)
Call:
glm(formula = yy ˜ i, family = gaussian(link = ”log”))
Coefficients:
Estimate Std . Error t value Pr(¿—t— ) (Intercept) 2 .932328 0 .283432 10 .346 4 .85e- 12 *** i 0 .062547 0 .009531 6 .562 1 .62e-07 ***
---
Signif . codes: 0 ‘***’ 0 .001 ‘**’ 0 .01 ‘*’ 0 .05 ‘ . ’ 0 . 1 ‘ ’ 1
Null deviance: 119075 Residual deviance: 40638 AIC: 361 .21
on 35
on 34
degrees of freedom
degrees of freedom
¿ yy2¡-log(yy)
¿ lgmodB . 1¡-glm(yy2˜i,gaussian(link=’identity’))
¿ summary(lgmodB . 1)
Call:
glm(formula = yy2 ˜ i, family = gaussian(link = ”identity”))
Coefficients:
Estimate Std . Error t value Pr(¿—t— ) (Intercept) 2 .598131 0 . 166012 15 .650 ¡ 2e- 16 *** i 0 .072199 0 .007824 9 .227 8 .77e- 11 ***
---
Signif . codes: 0 ‘***’ 0 .001 ‘**’ 0 .01 ‘*’ 0 .05 ‘ . ’ 0 . 1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0 .2378461)
Null deviance: 28 .3383 Residual deviance: 8 .0868 AIC: 54 .405
on 35
on 34
degrees of freedom
degrees of freedom
The researcher then went on to implement a local linear estimator to describe the association between the transformed monthly number of cases, Yi\\ , and the covariate denoting the number of months that passed since the outbreak, xi = i . The estimated function appears in Figure 3 (top), each curve corresponding to a different choice of bandwidth .
Before comparing the parametric and nonparametric estimates, the researcher also wants to check using nonparametric density estimation that the use of a Gaussian distribution in model lgmodB . 1 is indeed justified . (S)he implements a kernel-based density estimator for the the deviance residuals associated to model lgmodB . 1, using the Epanechnikov kernel with optimal bandwidth . The
histogram and resulting estimate appear in Figure 3 (bottom) . You are also given that the variance of the deviance residuals is 0.23 .
Residuals vs Fitted
3.0 3.5 4.0 4.5 5.0
Predicted values
Normal Q−Q
23 31
|
−2 −1 0 1 2
Theoretical Quantiles
Residuals |
|
Residuals vs Fitted
3.0 3.5 4.0 4.5 5.0 |
Std. deviance resid. |
−2 −1 0 1 2 |
Predicted values
plots.pdf
Figure 2: Top: Residual checks for model gmodB .1; lgmodB .1 .
Normal Q−Q
−2 −1 0 1 2
Theoretical Quantiles
Bottom: Residual checks for model
|
0 5 10 15 20 25 30 35
Recorded month
plots.pdf |
|
−0.5 Deviance residuals |
Figure 3: Top: Nonparametric regression estimates (red and blue curves) connecting the log of the number of cases and the number of months since the outbreak started, ob- tained using two different bandwidths; Bottom: Density estimate (red curve) for deviance residuals of model lgmodB .1, with superimposed theoretical normal density (blue curve) .
(a) Write down the estimated models gmodB . 1 and lgmodB . 1, making sure you specify their respective response distribution and link function, while also stating the precise relationship between their respective dispersion parameter and variance . Formulate the deviance-based estimator of the dispersion parameter and obtain the estimated variance under model gmodB . 1 . [10]
2022-08-13