STAT4027: Advanced Statistical Modelling Semester 2, 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT4027: Advanced Statistical Modelling
Semester 2, 2021
Answer ALL questions.
1. A random variable Y follows the distribution with probability density function (pdf) f (y) = e_ [(y _τ )+e- (g -T ) , y ← R, τ > 0. (1)
The mean and variance of Y are respectively given by
E(Y) = τ + γ (γ = 0.5772) and Var(Y) =
6 .
(a) Show that f (y) can be written as
f (y) = exp ┌ T (y一φ)b(θ) + c(φ, y), .
Identify θ , b(θ), a(φ), c(φ, y) and the sufficient statistics T (y).
(b) Find E(T (Y)) = b\ (θ) and Var(T (Y)) = a(φ)b\\ (θ).
(c) Based on a random sample z = (y1 , . . . , yn ) from the distribution in (1),
n
(i) Write down the loglikelihood function a(τ ; z) = z ln f (yi ) and evaluate its first i=1
and second order derivative functions, a\ (τ ; z) and a\\ (τ ; z). (ii) Derive the maximum likelihood (ML) estimate and the variance of .
(iii) Show that in (ii) agrees with the result E(T (Y)) = b\ (θ) in (b).
1 n 1 n
(iv) Find an expression of AIC in terms of = z yi and = z zi , where
zi = e_yi .
(d) Using the results in (2),
(i) Derive the method of moment (MOM) estimate and the variance of . (ii) Show that the MOM estimate is less efficient than the ML estimate in (c)(ii).
2. The data dat contain 15 patients under two different treatments.
Patient |
Survival or censor |
Outcome |
Treatment |
Age (years) |
i |
time (days) ti |
wi |
ci |
xi |
1 |
1 |
Died |
1 |
75 |
2 |
1 |
Died |
1 |
79 |
3 |
4 |
Died |
1 |
85 |
4 |
5 |
Died |
1 |
76 |
5 |
6 |
Censored |
1 |
66 |
6 |
8 |
Died |
0 |
75 |
7 |
9 |
Censored |
1 |
72 |
8 |
9 |
Died |
1 |
70 |
9 |
12 |
Died |
0 |
71 |
10 |
15 |
Censored |
0 |
73 |
11 |
22 |
Died |
1 |
66 |
12 |
25 |
Censored |
0 |
73 |
13 |
37 |
Died |
0 |
68 |
14 |
55 |
Died |
0 |
59 |
15 |
72 |
Censored |
0 |
61 |
Vectors t, w (1 for died; 0 for censored), treatment and age store the values of ourvivrl or 6ensor time, Оut6ome, srertment and Age respectively.
(a) The first study investigates the overall survival rate. The cumulative hazard is
plotted in the following figure.
> Sfit=survfit(Surv(t,w)~1,data=dat)
> ggsurvplot(Sfit,conf.int = TRUE ,ggtheme=theme_bw(),fun="cumhaz")
(i) The vector of observed cumulative hazard H is calculated in the hidden procedure ‘???????’ using the observed survival rate S from survfit. The value of y[2] is also hidden by ‘*******’ . Calculate the hidden value.
> S=summary(Sfit)$surv; round(S,4)
[1] 0 .8667 0 .8000 0 .7333 0 .6600 0 .5867 0 .5029 0 .4023 0 .2682 0 .1341
> H=???????; round(H,4) #calculate obs cum hazard using S
[1] 0 .1431 ******* 0 .3102 0 .4155 0 .5333 0 .6874 0 .9106 1 .3161 2 .0092
(ii) From the above cumulative hazard plot, there seems to be two linear relation-
ships. Hence a change point model is fitted to the observed cumulative hazard H against time which is outputed as below.
> time=summary(Sfit)$time
> Hlm=lm(H~time)
> library("segmented")
> Hfit=segmented(Hlm,~time) #1 change pt; 2 lines
> round(summary(Hfit)$coef[1:3,c(1,4)],4)
Estimate Pr(>|t|)
(Intercept) 0 .0715 0 .2727
time 0 .0469 0 .0043
U1 .time -0 .0162 0 .1555
> round(summary(Hfit)$psi,4)
Initial Est . St .Err
psi1 .time 9 11 .7511 5 .0535
Write down the hazard function h(t) showing all the necessary steps.
(iii) Provide two reasons why one can approximate the first intercept 0.0715 of Hfit
to be zero. Write down the cumulative hazard function H(t) showing all neces- sary steps.
The plot of the observed and fitted cumulative hazard functions is provided below.
Fitted and observed cumulative hazard
50
time
(b) Assume that the survival time ti follows Loglogisti6 yLLA 户isk( distribution:
αλ(λt)α _ 1 (λt)α
ti ~ LL(α, λ) with pdf fLL (t) = [1 + (λt)α )]2 and cdf FLL (t) = 1 + (λt)α .
(i) Derive the baseline hazard hLL ,0 (t) and cumulative hazard HLL ,0 (t) functions without any regression model.
(ii) It is know that if T follows the LL distribution, Y = log T follows the Logistic
(L) distribution with cdf
FL (y|µ, σ2 ) = exp( )
where σ is the scale parameter, µ = bT b is the mean function, b is a vector of covariates and b is a vector of regression parameters.
Derive the survival function SLL (t) for the LL model in terms of µ and σ starting with
SLL (t) = Pr(T > t) = Pr(log T > log t) = 1 一 FL (log t) = ... (3)
(iii) Let µ = b0 + b1 x. Using the result that
exp ╱ log σ(t)一 µ 、= t exp(一b0 ) exp(一b1 x/σ) = (λt)α exp(β1 x)
where α = , λ = exp(一b0 ), λα = exp(一b0 /σ) and β1 = 一b1 /σ ,
derive HLL (t) using SLL (t) in (3) and hLL (t) using HL (t) with the regression model µ = b0 + b1 x.
Hence explain that the LL model is not a proportional hazard model, that is, HLL (t) HLL ,0 (t) exp(β1 x) and hLL (t) hLL ,0 (t) exp(β1 x)
using the results in (b)(i).
(iv) Explain if the Poisson proxy model wi ~ Poisson(µp,i), µp,i = HLL (ti ) can
be applied to estimate the parameters α , β0 and β1 . Discuss any difficulties associated with this method and suggest a workable method.
(c) The data are fitted to the independen6e model with LL distribution and the output is given below.
> summary(survreg(Surv(t,w)~treatmemt+age,dist="loglogistic"))
Call:
survreg(formula = Surv(t, w) ~ treatment + age, dist = "loglogistic") Value Std . Error z p
(Intercept) 11 .9044
treatment age Log(scale) |
-0 .9258 -0 .1239 -0 .7973 |
2 .6594 4 .48 7 .6e-06
0 .4677 -1 .98 0 .0478
0 .0385 -3 .22 0 .0013
0 .2645 -3 .01 0 .0026
Scale= 0 .451
Log logistic distribution
Loglik(model)= -35 .4 Loglik(intercept only)= -42 .9
Chisq= 14 .99 on 2 degrees of freedom, p= 0 .00056 Number of Newton-Raphson Iterations: 6
n= 15
(i) Calculate the mean µ for the patient with trertment and age b). Using the survival function SLL (t) in (3) that you have derived, evaluate the survival rate for this patient at the time t = 50.
(ii) The following plot provides the survival functions for age 65, 75 and 85 and with
and without treatment. State and explain briefly the age and treatment status for the blue dash line.
Survival functions
|
● |
|
|
|
|
|
|
|
|
|
|
|
70
survival time (days)
3. An automobile accident data contains n = 6274 injuries o。served over four factors:
1. Gender (G):
2. Location (L):
3. Seat belt (S):
4. Injury (I):
Female (F) or Male (M) (denoted by αi in the model formula);
Urban (U) or Rural (R) (denoted by βj in the model formula);
No (N) or Yes (Y) (denoted by γk in the model formula);
0 not hospitalised; 1 hospitalised but did not die; 2 died (denoted by δl in the model formula).
Two Poisson loglinear models, Model A (mA) and Model B (mB) given below:
Model A: ln(µijkl ) = µ + αi + βj + γk + δl + (αγ)ik + (αδ)il + (γδ)kl + (βδ)jl + (αγδ)ikl (4)
Model B: ln(µijkl ) = µ + αi + βj + γk + δl + (αγ)ik + (αδ)il + (γδ)kl + (βγ)jk (5)
are applied where i, j, k = 0, 1 and l = 0, 1, 2. The baseline categories are “female”, “urban”, “no” seat belt and “0” level of injury with i, j, k, l = 0.
(a) For each of Model A and B, explain briefly if it is de6omposr。le using sufficient con-
figurations and if it is decomposable, describe the dependen6y stru6ture and express pijkl , where pijkl 全 µijkl /n, in terms of some marginal probabilities.
(b) The parameter estimates for Model A are given by
> mA=glm(y~G*S*I+L*I,family=poisson) #y,G,S,I,L are 24x1 vectors
> pA=mA$coef; round(pA,3)
(Intercept) G1 S1 I1 I2 6 .7177 -0 .0774 -0 .2094 -2 .2433 -4 .4349
L1 G1:S1
0 .0287 -0 .4935
G1:I1 G1:I2 S1:I1 S1:I2 I1:L1 I2:L1 G1:S1:I1 G1:S1:I2
0 .2049 0 .4414 -0 .4445 -0 .2853 0 .5861 1 .1287 0 .2080 -0 .5244
A 。inomirl model with logit link using Model A and the sert 。elt (S; γk ) factor as the out6ome variable is applied. Parameters at baseline levels are set to 0.
pij1l
bols of the loglinear Model A in (4) with suitable subscripts and adding super- scripts “*” to the parameters to distinguish them from those in the loglinear model. Hint: look for the terms that involve γ first.
(ii) The logit model in (i) includes the term
α 1(*) = ln p1010 /p1000
(6)
1010
1000
(i = 0). State which term in the loglinear Model A in (4) is defined in the p0000
same way as (6). Hence provide the estimated value of α 1(*) .
(iii) Similarly, a higher order interaction term in the logit model in (i) also equals to a term in the loglinear Model A in (4). Write down the symbol of this higher order interaction term in the logit model and its estimated value. Interpret the parameter value of this higher order interaction term.
(c) Using Model B and the injury (I; δl ) factor as the out6ome variable, a multinomirl model given by
pijkl A A A
where i, j, k = 0, 1, l = 0, 1, 2 is applied and the parameters δl(A), αil(A), γkl(A) are 0 for the baseline level if any subscripts i, k, l are 0. The parameter estimates are given below:
> library(nnet)
> nB=multinom(ym~G+S) #ym is a 8x3 matrix; G,S are 8x1 factor vectors > round(summary(nB)$coefficients,4)
(Intercept) G1 S1
y1 -1 .9389 0 .2704 -0 .3471
y2 -3 .6320 0 .3025 -0 .5160
For an accident with a mrle driver (G; i = 1) in rny (L; j) area not wearing a seat belt (S; k = 0), estimate the probabilities for all injury levels (I; l = 0, 1, 2) as follow.
(i) Calculate the probability ratios and
(ii) By making use of the results in (i) and the fact that p1j00 + p1j01 + p1j02 = 1, calculate the probability p1j00 .
(iii) Hence calculate the probabilities p1j01 and p1j02 .
2022-11-07