Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT6030 GENERALISED LINEAR MODELLING

Assignment 2

2023 Summer Session

Instructions

• This assignment is worth 55 marks in total and 25% of your overall marks for this course. The assignment is compulsory and must be submitted by 5pm on Monday 6 March 2023.

• You must write your answers to this assignment individually and by yourself. If you copy someone else's work or allow your work to be copied, you will receive a mark of zero for the assignment and risk severe academic consequences.

• Your answers should be individually submitted through Turnitin on Wattle as a single pdf/Word document (less than 50MB) including the following:

1. The assignment Cover Sheet (available on Wattle).

2. Your answers (no more than 10 pages including graphs, summaries, tables, etc... but not Appendix and Cover Sheet, and respecting the other requirements for each part).

3. An Appendix including all the R commands you used (no page limit).

• Assignments should be typed and not handwritten. Your assignment may include some carefully edited R output (e.g., graphs, summaries, tables, etc...) and appropriate discussion of these results, as well as some selected R commands. Please be selective about what you present and only include as many pages and as much R output as necessary to justify your solution. Clearly label each part and question of your assignment and appendix with the corresponding numbers.

• Unless otherwise advised, use a significance level of 5%.

• Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012).

• Marks will be deducted if these instructions are not strictly respected, especially when the total report is of an unreasonable length, i.e., more than the above page limit. The Appendix will generally not be marked and checked if what you have written or done needs clarifications.

• Name your submission “CourseCode_Uid”,e.g., “STAT6030_u1234567”.

• Try to submit your assignment at least 30 minutes before the deadline in case something unexpected happens, for instance an internet connection problem.

Late submissions will NOT be accepted. Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must receive lecturer's approval at least 24 hours before the deadline.


Part 1 [16 Marks]


Please provide your answers to the following questions and include short working out if there is any. There is a limit of 3 pages on your answers for Part 1.

(a) [1 mark] What is the definition of canonical link function in the context of generalised linear models?

(b) [1 mark] Explain in words and/or by drawing a plot when a link function of a generalised linear model is valid.

(c) [1 mark] In the context of generalised linear models, does the value of the maximised log-likelihood for the saturated model depend on the choice of link function and why?

(d) [1 mark] The mean of a generalised linear model is known to lie between 1 and 2

whatever the value of the linear predictor n = is, i.e. 1 < < 2. Let denote

the cumulative distribution function of the standard normal distribution N(0,1) and $-1 denote the inverse function of $. Which function below is an appropriate link function in this setting? Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores full marks for the question.

A. n = g(" where g(伉)=収〃《-1).

B. ni = g(" where g(伉)=(〃《/2).

C. ni = g(〃i), where g(〃《)="(伉-1).

D. n = g(〃i), where g(〃《)="(伉/2).

(e) [1 mark] The gamma distribution has probability density function

/ (y;a,月)={扩/(a)}。*-1 exp(—^y),

where y > 0, a > 0 is a shape parameter, /3 > 0 is a rate parameter and(•)is the gamma function. You may assume that

(i) the mean of the gamma distribution is given by = a/8;

(ii) the gamma distribution is a generalised linear model with dispersion parameter © = 1/a, in the notation of equation (4.1) of Topic 4.

What is the canonical link function when the generalised linear model is gamma?

(f) [3 marks] The geometric distribution has probability mass function f (y; p) = (1 — p)py, for y = 0,1,..., where 0 < p < 1. What are the canonical link function and variance function of the geometric distribution?

The deviance residual for observation i is given by sign(y《—&i)^/d2, where

2

d = ©[yi{ h(yi)狀庆)} {b(h(yi)) — b(h3i))}]

is the deviance associated with observation i, which is written as a function of the response variable y《and of the fitted value hi, while sign(・) is the sign function defined in the lecture notes. Also recall that b'-1(〃)= h(〃). What is the expression for d2, as a function of y《and &《,when the generalised linear model is geometric? Please simplify your expression as much as you can.

(g) [1 mark] Consider a generalised linear model with linear predictor n = U + xj0, where Ui is an offset, x《is a vector of covariates of length p and 0 is a parameter vector of length p to be estimated. Assuming that the model's dispersion parameter © = 1 is known, how many free parameters (i.e., parameters to estimate) are there in this model?

(h) [1 mark] A logistic regression model was fitted to a dataset consisting of a binary outcome variable, y《,taking values 0 and 1, and a single numerical covariate Xi. The estimated intercept and slope on the linear predictor scale were found to be -0.47 and 1.3, respectively, so that the linear predictor as a function of x《is given by

n(xi) = -0.47 + 1.3xi.

Recall the estimated probability Prob[y《=1|x《] is given by

Prob[yi = 1|xi] = exp{n(Xi)}/[1 + exp{n(xi)}]

and so the estimated probability Prob[y《=0|x《] is given by 1 — Prob[y《=1|x《]. What is the value of x《such that the odds of the event y = 1 is 0.75? Recall that the odds of an event that occurs with probability n is given by n/(1 — n).

(i) [2 marks] Consider a distribution with the probability density function

/(y; 口)= [1/(2n)'〃 exp[—(y ― 口)2/(2口2y)],

where is the mean of the distribution and y > 0. What is the variance function, V(〃), of this distribution?

(j) [1 mark] The following output from a linear regression model fit in R was obtained. Calculate the value for ++++ that the R program would give if the sample size is 10.

Call:

Im(formula = y ~ x)

Coefficients :

Estimate

Std . Error

t value

Pr( >111)

( Intercept)

0.08888

0.66793

0.133

0.897

x

1.06903

0. 10765

????

Illi

(k) [1 mark] Suppose we fit a Poisson regression model A with log link to a dataset whose response variable is a count. No offset is included. In the fitted model we have included a covariate x and the estimated coefficient of x is .Suppose that we then decide to fit a second model B which is the same as model A but with x included as an offset as well as included in the linear predictor as before. Suppose the estimated coefficient of x is (3b in model B. Which of the following statements about the second fitted model is correct?

Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores full marks for the question.

A. (b = (a 1 and the residual deviance of model B will (usually) change compared to that of model A.

B. Pb = 3a 1 and the residual deviance of model B will not change compared to that of model A.

C. 3b = Pa + 1 and the residual deviance of model B will (usually) change compared to that of model A.

D. 3b = 3a + 1 and the residual deviance of model B will not change compared to that of model A.

(l) [2 marks] Suppose we have fitted a Poisson log-linear regression with extra-Poisson variation and the estimate of the dispersion parameter © is greater than 1. If the standard Poisson model was used in this situation, would this be likely to be a case of underdispersion or overdispersion, and which assumption between mean and variance of the Poisson distribution should fail? What would happen to the estimates of the 0 parameters for the standard Poisson model?



Part 2 [12 Marks]


Different doses of two chemicals, A and B, were used in a trial whose purpose was to reduce cockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 gives the dose of chemical B. In the R code below, the first column of c gives the number of cockroaches killed and the second column of c gives the number of cockroaches that survived. The following R outputs were obtained:

> out=glm(c~xl+x2,family=binomial)

> summary(out)

Call:

glm(formula = c ~ xl + x2,





Please provide your answers to the following questions and include short working out if there is any. There is a limit of 2 pages on your answers for Part 2.

(a) [1 mark] What type of generalised linear model is being fitted here and what link function is being used?

(b) [5 marks] Determine the missing information indicated by the letters A, B, C, D, E, F, G, H, J and K. Note that for E you are required to specify the link function.

(c) [2 marks] Write down the relevant model in mathematical form, focusing on the contribution of observation i to the model.

(d) [2 marks] Briefly indicate your impressions of the results of the statistical analysis provided above.

(e) [2 marks] What are the next questions you would investigate in the statistical analysis? State what your next two steps would be.



Part 3 [12 Marks]

The presence of sprouted or diseased kernels in wheat can reduce the value of a wheat producer's entire crop. It is important to identify these kernels after being harvested but prior to sale. To facilitate this identification process, automated systems have been developed to separate healthy kernels from the rest. Improving these systems requires a better understanding of the measurable ways in which healthy kernels differ from kernels that have sprouted prematurely or are infected with a fungus. To this end, Martin et al. (1998) conducted a study examining numerous physical properties of kernels - density, hardness, size, weight, and moisture - measured on a sample of wheat kernels from two different classes of wheat, hard red winter (hrw) and soft red winter (srw) (represented by the categorical variable class) in the wheat.csv dataset on Wattle. Each kernel's condition was also classified as “Healthy”, “Partly Diseased” and “Diseased” by human visual inspection (represented by the categorical variable type2).

Please provide your answers to the following questions and include short working out if there is any. There is a limit of 3 pages on your answers for Part 3.

Throughout the following questions, treat type2 as the response variable.

Suppose that we have conducted the following R analysis and obtained the R output below:

> summary(fit1)

Call:

multinom(formula = type2 ~ class + density + hardness + size + weight + moisture, data = wheat)

Coefficients:

(Intercept)

classhrw

density

hardness

size

Healthy

-29.89783

-0.6480912

21.596892

0.015904714

-1.0691104

Partly Diseased

-10.95451

-0.4233621

6.480466

-0.005114335

-0.1935257

weight

moisture

Healthy

0.2896462 -0

l1095777

Partly Diseased

0.2423342 -0

1525636