Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INCOMPLETE DATA ANALYSIS

Assignment 2

• Location for submission: Gradescope over Learn. Important: When uploading your report to Gradescope please tag separately each subquestion (e.g. 1a), 1b), 1c), etc).

• This assignment is worth 40% of your final grade for the course.

• Assignments should be typed (LATEX, word, etc.).

• Answers to questions should be in full sentences and should provide all necessary details.

• Any output (e.g., graphs, tables) from R that you use to answer questions must be included with the assignment. Also, please include your R code in the assignment (screenshots of the

R console are not allowed) or make it available in a public repository (e.g., GitHub).

• The assignment is out of 100 marks.

1. Suppose X and Y are independent, Pareto-distributed, with cumulative distributions given by

FX (x;λ) = 1 ,    FY (y;λ) = 1  ,

with x,y ≥ 1 and λ,µ > 0. Let Z = min{X,Y } and define the (non)censoring indicator

δ =

(This type of censoring is often known as“type I censoring.”)

(a) (10 marks) Obtain the density function of Z (fZ ) and the frequency function of δ (f6 ). What are the distributions of Z and δ?

(b) (5 marks) Let Z1 , . . . ,Zn be a random sample from fZ (z;θ), with θ = λ + µ, and let δ 1 , . . . ,δn be a random sample from f6 (d;p), with p = λ/(λ + µ). Derive the maximum liklihood estimators of θ and p.

(c) (8 marks) Appealing to the asymptotic normality of the maximum likelihood estimator, provide a 95% confidence interval for θ and for p.

2. Suppose that Yi    N(µ,σ2 ), for i = 1, . . . ,n. Further suppose that now observations are (left) censored if Yi  < D, for some known D and let

Xi  = {D(Yi)

if Yi  ≥ D,

if Yi  < D,

Ri  = {0(1)

if Yi  ≥ D,

if Yi  < D .

Left censored data commonly arise when measurement instruments are inaccurate below a lower limit of detection and, as such, this limit is then reported.

(a) (6 marks) Show that the log likelihood of the observed data {(xi ,ri )} is given by

n

log L(µ,σ2  | x, r) =zri log ϕ(xi ;µ,σ2 ) + (1 ri )log Φ(xi ;µ,σ2 ) , ,

i=1

where ϕ(· ;µ,σ2 ) and Φ(· ;µ,σ2 ) stands, respectively, for the density function and cumu- lative distribution function of the normal distribution with mean µ and variance σ 2 .

(b) (6 marks) Determine the maximum likelihood estimate of µ based on the data available in the file dataex2 .Rdata. Consider σ 2 known and equal to 1.52 . Note: You can use a built in function such as optim or the maxLik package in your implementation.

3. Consider a bivariate normal sample (Y1 ,Y2 ) with parameters θ = (µ1 ,µ2 ,σ1(2),σ 12 ,σ2(2)). The variable Y1 is fully observed, while some values of Y2 are missing. Let R be the missingness indicator, taking the value 1 for observed values and 0 for missing values. For the following missing data mechanisms state, justifying, whether they are ignorable for likelihood-based estimation.

(a) (5 marks) logit{Pr(R = 0 | y1 ,y2 ,θ,ψ)} = ψ0 + ψ1y1 , ψ = (ψ0 ,ψ 1 ) distinct from θ . (b) (5 marks) logit{Pr(R = 0 | y1 ,y2 ,θ,ψ)} = ψ0 + ψ1y2 , ψ = (ψ0 ,ψ 1 ) distinct from θ .

(c) (5 marks) logit{Pr(R = 0 | y1 ,y2 ,θ,ψ)} = 0.5(µ1 + ψy1 ), scalar ψ distinct from θ .

4. (25 marks) Suppose that

Y Bernoulli{pi (β)},

exp(β0 + xi β1 )   

pi (β) =

for i = 1, . . . ,n and β = (β0 ,β1 )\ . Although the covariate x is fully observed, the response variable Y has missing values. Assuming ignorability, derive and implement an EM algo- rithm to compute the maximum likelihood estimate of β based on the data available in the file dataex4 .Rdata. Note: 1) For simplicity, and without loss of generality because we have a univariate pattern of missingness, when writing down your expressions, you can as- sume that the first m values of Y are observed and the remaining n − m are missing. 2) You can use a built in function such as optim or the maxLik package for the M-step.

5. Consider a random sample Y1 , . . . ,Yn from the mixture distribution with cumulative distri- bution function

F(y) = pFX (y;λ) + (1 − p)FY (y;µ),

where FX (x;λ) = 1 − x−λ , FY (y;µ) = 1 − y−µ , with x,y ≥ 1 and λ,µ > 0.

(a) (13 marks) Let θ = (p,λ,µ). Derive the EM algorithm to find the updating equations for θ (t+1)  = (p(t+1),λ (t+1),µ (t+1)).

(b) (12 marks) Using the dataset dataex5 .Rdata implement the algorithm and find the maximum likelihood estimates for each component of θ .  As starting values, consider θ (0)  = (p(0) ,λ(0) ,µ(0) , ) = (0.3, 0.3, 0.4) and as stopping criterion use

Iβt+1) βt) I + Iβ t+1) βt) I < 0.0001.

Draw the histogram of the data with the estimated density superimposed. Hint: Use the Freedman–Diaconis rule for selecting the number of breaks in the histogram.