MATH50011 Statistical Modelling 1 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MATH50011
BSc, MSci and MSc EXAMINATIONS (MATHEMATICS)
2021
Statistical Modelling 1
1. Let Y1 , . . . , Yn be independent and identically distributed (i.i.d.) N (µ, σ2 ) with σ 2 > 0 known. Suppose that interest lies in estimating θ = P (Yi ≤ 0).
(a) State the maximum likelihood estimator (MLE) for µ based on Y1 , . . . , Yn and state the distribution for Za such that √n( − µ) →d Za . (2 marks)
(b) Find the MLE θˆY for θ based on Y1 , . . . , Yn and derive the distribution for Zb such that √n(θˆY − θ) →d Zb .
(6 marks)
(c) For each i = 1, . . . , n, define
Wi := I(Yi ≤ 0) =.
What is the distribution of Wi ? (2 marks) (d) State the MLE θˆW for θ based on W1 , . . . , Wn and state the distribution for Zd such that
√n (θˆW − θ) →d Zd .
(4 marks) (e) Show that Var (Zb ) ≤ Var (Zd ), where Zb and Zd follow the distributions identified in parts
(b) and (d). (6 marks) (Total: 20 marks)
2. The distribution of the concentration S (in milligrams per litre) of an enzyme in a certain biological system is assumed to be adequately described by the density function
fS (s; θ) = 3θ −3 s2 , 0 < s < θ
where θ > 0 is an unknown parameter. A biologist is interested in making statistical inferences about θ .
Suppose we have available n i.i.d. random variables Si , i = 1, . . . , n.
(a) (i) Show that S(n) := max{S1 , . . . , Sn } has density function
fS ←〉( (s; θ) = 3nθ −3ns3n − 1 , 0 < s < θ.
(4 marks) (ii) Using S1 , S2 , . . . , Sn , construct an exact 100(1 − α)% upper one-sided confidence interval (0, U) for the unknown parameter θ, where U is a function of S(n) := max{S1 , . . . , Sn }.
(4 marks)
(b) (i) Use the central limit theorem to show that
√n ╱ − θ、 →d N ╱0, θ 2 、.
(4 marks)
(ii) Using S1 , S2 , . . . , Sn , construct a large-sample 100(1 − α)% two-sided confidence interval for the unknown parameter θ based on = Si . (4 marks)
(c) Construct a level α hypothesis test of H0 : θ ≤ θ0 against H1 : θ > θ0 based on part (a).
(4 marks) (Total: 20 marks)
3. For i = 1, . . . , n, consider the linear models
Yi = β0 + β1 xi + ∈i (1)
xi = δ0 + δ1 zi + ηi . (2)
In matrix form we have V = xβ + ∈ and x = zδ + η. Suppose that zT x is non-singular, that z and x have full column rank, and that the errors in each model are i.i.d. with Var (∈i ) = σ 2 and Var (ηi ) = τ 2 .
Unless requested otherwise, express your responses in vector/matrix notation.
(a) Derive expressions for the hat matrix, P, and the vector of fitted values, , from least squares regression based on model (2). (3 marks)
(b) Replace xi by i in model (1) and show that the least squares estimator using the i is β~ = (zT x)− 1 zT V.
(4 marks)
(c) Find E (β~) and Cov (β~), treating the xi and zi as known constants. (6 marks)
(d) Show that the estimator of β1 obtained in part (c) is
β˜1 = ,
where , , and are the usual sample means. (3 marks)
(e) Let βZ denote the usual least squares estimator of β based on model (1). Show that Var (βˆ1 ) ≤ Var (β˜1 ),
treating the xi and zi as known constants. (4 marks) (Total: 20 marks)
4. Suppose we wish to investigate the relationship between the daily cases of Salmonella and season (autumn, winter, spring, summer) in a particular city based on n = 365 consecutive days. Suppose further that we perform our analysis modelling the logarithm of the number of Salmonella cases as a function of season, rainfall (yes/no), and a possible season-rainfall interaction.
(a) Describe a linear model you could use to address this problem, clearly defining the predictor
variables and their corresponding model parameters. Assume that the error terms are i.i.d.
N(0, σ2 ) for some σ2 . (6 marks)
(b) Suppose we wish to test for the existence of a seasonal effect on Salmonella cases. State the hypotheses you would test in terms of your model parameters, the form of the test statistic you would use to test those hypotheses, and the methods you would use to determine a rejection region for the test. Provide explicit formulas wherever possible (matrix notation is allowed, provided the matrices are adequately defined). (5 marks)
(c) Suppose instead that we merely want to determine whether a season-rainfall interaction on Salmonella cases exists. State the hypotheses you would test in terms of your model parameters, the form of the test statistic you would use to test those hypotheses, and the methods you would use to determine a rejection region for the test. Provide explicit formulas wherever possible (matrix notation is allowed, provided the matrices are adequately defined).
(5 marks)
(d) A more realistic distribution for the errors is proposed to account for the correlation between days. Let ∈i = ρ∈i − 1 + δi where the δi are i.i.d. N(0, σ2 ) random variables and ∈0 := 0. Derive the joint distribution of the first three errors (∈1 , ∈2 , ∈3 ) for this model.
(4 marks) (Total: 20 marks)
l. (a) [Seen] (2A marks) The MLE is = = 1 yn .
Its distribution is ~ B(u. 72,n) for each n = 1. 2. 3. φ φ φ . Hence its asymptotic distribution is ′n( 一 u) →i B(0. 72 ) φ
(b) [Seen Similar] (6B marks) By functional invariance of the MLE, we have 9ˆb = Φ(一,7). To
find its asymptotic distribution, we apply the delta method with g(u) = Φ(一u,7) and g ′ (u) = 一o(一u,7),7. Hence,
′n(9ˆb 一 9) =′n(g() 一 g(u)) →i g′ (u)B(0. 72 ) = B(0. o(一u,7)2 ) φ
(c) [Seen Similar] (2A marks) Since wn = 1 with probability 9 and wn = 0 with probability 1 一 9, we have wn ~ /ernoulli(9).
(d) [Seen] (4A marks) We have previously seen that the MLE is 9ˆY = w¯ .
The asymptotic distribution of 9ˆY follows by either the central limit theorem or results about MLEs in regular models and is
′n(9ˆY 一 9) →i B(0. 9(1 一 9)) φ
(e) [Unseen] (6D marks) We know that 铲ar(Zd) is the Cram`er-Rao lower bound (CRLB) for a sample
with n = 1. Hence, n一1 铲ar(Zd) is the CRLB for a random sample of size n.
Next, we note that, for fixed sample size n, the MLE 9ˆY is an unbiased estimator of 9 with variance 铲ar(9ˆY ) = n一1 铲ar(Zi).
From the above observation, we have that n一1 铲ar(Zd) ≤ n一1 铲ar(Zi) by the CRLB. This is equivalent to 铲ar(Zd) ≤ 铲ar(Zi), which completes the proof.
2. (a) [Seen Similar]
(i) (4A marks) The cdf of S(n) is FS(n) (s; θ) = FS(s; θ)n = (s3 /θ3 )n. Differentiating this, the density of S(n) is
fS(n) (s; θ) = n ┌ ┐n − 1 3θ −3 s2 = 3nθ −3ns3n − 1 , 0 < s < θ.
(ii) (4C marks) Now, since S(n) < θ, let U have the structure cS(n), where c > 1. Then,
P (cS(n) > θ) = P ╱S(n) > 、
θ
= 3nθ −3n s3n − 1 ds = θ −3n [s3n]θ/c(θ)
θ/c
= 1 ′ c −3n = (1 ′ α)
so that c = α − 1/3n . So, the exact 100(1 ′ α)% confidence interval for θ is (0, α − 1/3n S(n)).
(b) [Seen Similar]
(i) (4A marks) We have
θ 3
E(S) = 3θ −3 s3 ds = θ;
0 4
θ 3
E(S2) = 3θ −3 s4 ds = θ 2 ;
0 5
Var(S) = θ 2 ′ θ 2 = θ 2 .
By the central limit theorem, we have
桂n( ′ θ) (d N(0, θ 2 ).
(ii) [Seen Method] (4C marks) By the weak law of large numbers, (p θ. Using this with
Slutsky’s lemma, we conclude
桂n43′╱θ) = 桂╱θ) = 桂n(49 (d N(0, 1).
Define c to be the value so that P (N(0, 1) > c) = α/2. Then, using the approximately pivotal
distribution we have
╱ ′ 桂(c)n ′ , + 桂(c)n ′ ←
is an asymptotically valid 100(1 ′ α)% two-sided confidence interval for θ .
(c) [Seen] (4A marks) Let (0, U) be the (1 ′ α) 一 100% confidence interval from part (a). By results in the lecture notes, a test that rejects H0 when θ0 (0, U) will have level α .
3. (a) [Seen Method] (3A marks) For this model, the hat matrix is / = z(zi z)− 1 zi. The fitted values
are then = /x.
(b) [Seen Method] (4C marks) First, we note that = (| ) = /x. Then,
3↓ = (i )− 1 i Y
= ((/x)i /x)− 1 (/x)i Y
= (xi /x)− 1 xi /Y
= (xi z(zi z)− 1 zi x)− 1 xi z(zi z)− 1 zi Y
= (zi x)− 1 (xi z(zi z)− 1 )− 1 xi z(zi z)− 1 zi Y
= (zi x)− 1 zi Y
Note: the penultimate step makes use of the identity (/A)− 1 = A − 1 /− 1 .
(c) [Seen Method] (3B marks) By linearity of expectation we have
本(3↓) = 本[(zi x)− 1 zi Y] = (zi x)− 1 zi 本(Y) = (zi x)− 1 zi x3 = 3 σ
(3B marks) Using properties of covariance and the matrix inverse we have
Bru(3↓) = Bru[(zi x)− 1 zi Y] = (zi x)− 1 zi [(zi x)− 1 zi ]i _2 = (xi /x)_2 σ
(d) [Seen Method] (3A marks) We found in part (b) that = (zi x)− 1 zi Y . Note that
zi Y = (n小¯ 2i Y)i 8 zi x = sn2¯(n) z(z¯)、8
(zi x)− 1 = 、 = 、 σ
Combining these, we find that
(zi x)− 1 zi Y = 、 s2i(n)Y(小¯)、 = snn(小¯)i222¯小¯Y、 = s 、
= s小¯ - 、 = s 、
(e) [Seen Method] (4D marks) Since _˜1 = ci 3↓ for ci = (08 1), we see that _˜1 is a linear unbiased
estimator for _1 . Hence, we can apply the Gauss-Markov theorem to reach the desired conclusion.
Alternatively, we have vaw(_˜1) = _2 =: _2 g22 .gz(2)2 and vaw(_ˆ1 ) = _2 .gzz. By the Cauchy-Schwarz inequality gz(2)2 ≤ gzzg22 so _2 .gzz ≤ _2 g22 .gz(2)2 , which proves the claim.
._ (a) [Seen Method] (6A marks) Multiple formulations are possible, providing season was modelled
with three indicator variables in the regression model. Let Y = log CASES. Here, we will define indicator variables SUMMER, AUTUMN , WINTER and RAIN and interaction terms S.R = SUMMER × RAIN , A.R = AUTUMN × RAIN , and W.R = WINTER × RAIN and use the regression model
E[Y] = β0 +β1×RAIN+β2×SUMMER+β3×AUTUMN+β4×WINTER+β5×S.R+β6×A.R+β7×W.R
Note that the intercept in this model corresponds to the mean cases on a spring day without rain. Various correct reparametrisations are possible.
(b) [Seen Method] (2B marks) If there is no difference in Salmonella cases by season, then the regression
parameter for every term that involves season in some way must be 0. Hence our null hypothesis needs to be H0 : β2 = β3 = β4 = β5 = β6 = β7 = 0.
(3D marks) Using the least squares estimator βˆ we can compute the F statistic of the form
(Aβˆ)T (A(XT X)一1 AT)一1 Aβˆ
2
where A contains the bottom 6 rows of an 8 dimensional identity matrix. In this case, we can use the F distribution with 6 and n 一 8 = 357 degrees of freedom, rejecting H0 if Q > F1一α,6,357. The same
statistic arises by considering
(RSS0 一 RSS)/6
RSS/357
where RSS = 3572 is computed from the full model and RSS0 is computed from a simple linear regression of Y on RAIN .
(c) [Seen Method] (2B marks) If there is to be no interaction between rain and season, then the regression parameter for every interaction term must be 0. Hence our null hypothesis needs to be H0 : β5 = β6 = β7 = 0.
(3D marks) This time the quadratic form is
(Aβˆ)T (A(XT X)一1 AT)一1 Aβˆ
2
where A contains the bottom 3 rows of an 8 dimensional identity matrix. In this case, we can use the F distribution with 3 and n 一 8 = 357 degrees of freedom, rejecting H0 if Q > F1一α,3,357. The same
statistic arises by considering
(RSS0 一 RSS)/3
RSS/357
where RSS = 3572 is computed from the full model and RSS0 is computed from a linear regression of Y on RAIN , SUMMER, AUTUMN , and WINTER.
(d) [Unseen] (4B marks) Note that ∈ 1 = δ 1 so that ∈2 = ρδ 1 + δ2 and ∈3 = ρ2 δ 1 + ρδ2 + δ3 . We rewrite this in matrix form as
╱ ∈ 1 ← ╱ 1
. ∈2 . = . ρ
『 ∈3 . 『ρ2
0 0← ╱δ 1 ←
1 0 . . δ2 . =: AZ
ρ 1. 『δ3 .
where Z ~ N3(0, Σ = σ 2 I3×3). By linearity properties of the multivariate normal distribution, we
have
╱ ∈ 1 ←
2022-05-23