MAST90084: Statistical Modelling Assignment 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MAST90084: Statistical Modelling Assignment 1
1. Let X and Y be two categorical random variables, X with I different categories identified with the set {1, . . . , I} and Y with J different categories identified with the set {1, . . . , J}. Suppose observations of the variable pair (X, Y) are tabulated in a I × J contingency table. Using standard notations, for a given (i, j) e {1, . . . , I} × {1, . . . , J}, nij is the entry in the (i, j)-th cell that denotes the count of observations with X equal to its i-th category and Y equal to its j-th category. A Poisson sampling model for the contingency table assumes that the nij’s are independently distributed with
nij ~ Poi(µij),
where µij denotes the Poisson mean for the cell count nij .
(a) Derive the conditional joint distribution of {nij}(i,j)e(1,...,I}x(1,...J} given n, where n := 1<i<I nij .
1<j<J Identify the name of this distribution, and explicitly state what its parameter values are in terms of
{µij}(i,j)e(1,...,I}x(1,...J} and n. [5]
(b) Let I = J = 2. The quantity , also known as the odds ratio, measures the association between
X and Y . What should be the value of the odds ratio if X and Y are independent and why? [3]
2. Data in the following 2 × 2 × 3 contingency table were used to study the effect of passive smoking on lung cancer. The table summarizes the results of case-control studies from 3 countries for nonsmoking women married to smokers. (Source: Blot and Fraumeni, J. Nat. Cancer Inst., 77:993-1000 (1986) and Agresti
(1996).)
Country |
Spouse Smoked |
Cases |
Controls |
Japan |
No Yes |
21 73 |
82 188 |
UK |
No Yes |
5 19 |
16 38 |
USA |
No Yes |
71 137 |
249 363 |
(a) A log-linear model mod1 can be fitted to the data, with the results being given in the following R output. Give the mathematical formula of form ln(µ) = . . . for the mean model of mod1, where µ is the mean of the response. Any dummy variables in your formula should be explicitly defined. [5]
> pasSmoking .dat=data .frame(freq=c(21,73,5,19,71,137,82,188,16,38,249,363))
> pasSmoking .dat$Cnt=factor(rep(c("Japan","UK", "USA"), times=2, each=2))
> pasSmoking .dat$Smo=factor(rep(c("No","Yes"), times=6))
> pasSmoking .dat$Can=factor(rep(c("Case","Control"), each=6))
> pasSmoking .dat
freq Cnt Smo Can
1 21 Japan No Case
2 73 Japan Yes Case
3 5 UK No Case
4 19 UK Yes Case
5 71 USA No Case
6 137 USA Yes Case
7 82 Japan No Control
8 188 Japan Yes Control
9 16 UK No Control
10 38 UK Yes Control
11 249 USA No Control
12 363 USA Yes Control
> mod1=glm(freq~Cnt+Smo+Can+Cnt:Smo+Cnt:Can+Smo:Can, family=poisson, data=pasSmoking .dat) > anova(mod1, test="Chisq")
Analysis of Deviance Table; Model: poisson; Link: log; Response: freq
Terms added sequentially (first to last)
Df Deviance Resid . Df Resid . Dev P(>|Chi|)
NULL
Cnt
Smo
Can Cnt:Smo Cnt:Can Smo:Can
11 1168 .85
2 726 .43 9 442 .42 < 2 .2e-16
1 112 .52 8 329 .90 < 2 .2e-16
1 307 .56 7 22 .34 < 2 .2e-16
2 15 .50 5 6 .84 0 .0004316
2 1 .05 3 5 .80 0 .5919109
1 5 .56 2 0 .24 0 .0184215
> 1 -pchisq(0 .24,2)
[1] 0 .8869204
> 1 -pchisq(5 .80,3)
[1] 0 .1217566
(b) Expanding the notation from Question 1, for the current contingency table we can also use nijk to denote the count in each cell, where i e {1, 2}, j e {1, 2}, k e {1, 2, 3} are indices corresponding to Can (variable X), Smo (variable Y) and Cnt (variable Z) respectively. Moreover, if nijk are independently distributed with
nijk ~ Poi(µijk),
one can, for any k e {1, 2, 3}, define the odd ratios θXY (k) = for the partial table with Z = k .
The table is said to have homogeneous XY association when θXY (1) = θXY (2) = θXY (3) . Explain why
the model in part (a) has XY homogenous association. [5]
(c) Based on the displayed R output in (a), test the significance of the interaction effect Smo:Can at significance level 0.05, eliminating the effects of all other terms in mod1. Provide your conclusion with clear explanation. [4]
(d) Based on the displayed R output in (a), test the adequacy/goodness-of-fit of model
Cnt+Smo+Can+Cnt:Smo+Cnt:Can
at significance level 0.05. Provide your conclusion with clear explanation.
(e) Are your conclusions in (c) and (d) contradictory? You must give an explanation to get any score. [5]
3. A variable Y taking values in {0, 1, 2, . . . } has a Negative Binomial (NB) distribution if its probability
mass function has the form
p(Y = y; µ, κ) =
for y = 0, 1, . . . , where µ is the mean of Y .
(a) When κ is considered as fixed (or known), the NB distribution belongs to the exponential dispersion
model (EDM) discussed in class. Write out its form as an EDM explicitly. In particular, you have to identify the natural parameter θ and the dispersion parameter φ in terms of µ and κ whenever appropriate, and identify b(.) (as a function of θ). You can simply take the weight ω to be 1. [5]
(b) Let σ 2 be the variance of Y . From your answer above, derive the formula for σ 2 as a function of µ . Why do we say that the NB distribution can be used as a likelihood model to handle “overdispersion” compared to the Poisson distribution? [4]
2023-03-24