关键词 > STAT3500/STAT7500

STAT3500/STAT7500 Assignment 4 —— Generalized Linear Models

发布时间：2023-10-17

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT3500/STAT7500 Assignment 4 —— Generalized Linear Models

Due Date: 10 November 2023

Weighting: 25%

Instructions

❼ The assignment consists of 4 (four) problems, each problem is worth 25 marks, and each mark is equally weighted.

❼ The mathematical elements of the assignment can be completed by hand, in LaTeX (prefer- ably), or in Word (or other typesetting software). The mathematical derivations and ma- nipulations should be accompanied by clear explanations in English regarding necessary information required to interpret the mathematical exposition.

❼ Computation problems should be answered using programs in the R language.

❼ Computer generated plots and hand drawn graphs should be included together with the text where problems are answered.

❼ Submission les should include the following (which ever applies to you):

Scans of handwritten mathematical exposition.

Typeset mathematical exposition, outputted as a pdf le.

Typeset answers to computational problems, outputted as a pdf le.

Program code/scripts that you wish to submit, outputted as a txt le.

❼ Mathematical problems should be answered with reference the Main Text (refer, page num- bers), Remarks, Exercises, and Corollaries/Lemma/Propositions/Theorems from the Lec- ture Notes, if required. If a mathematical result is used that is not presented in the Lec- ture Notes, then its common name (e.g., Bayes' Theorem , Intermediate Value Theorem , Borel Cantelli Lemma , etc.) should be cited, or else a reference to a text containing the result should be provided (preferably a textbook).

❼ All submission les should be labeled with your name and student number and archived together in a zip le and submitted at the TurnItIn link on Blackboard. We suggest naming using the convention:

[LastName_FirstName/StudentNumber]_STAT3500A4_ [AnythingElse].[FileExtension].

❼ As per my.uq.edu.au/information-and-services/manage-my-program/student-in tegrityand-conduct/academic-integrity-and-student-conduct, what you submit should be your own work. Even where working from sources, you should endeavour to write in your own words. You should use consistent notation throughout your assignment and de ne whatever is required.

Problem 1 [25 Marks]

The age_at_mar data set from the openintro R package contains n = 5534 replicates Y1 , Y2 , . . . of random variable Y, describing the age (in year) of rst marriage for a woman in the United States, surveyed between 2006 and 2010.

We can consider that Y : Ω → Z≥0 is a count and thus can be modeled using a distribution on Z≥0 .

(a) Fit a Poisson, geometric, and negative binomial model to the data Y1 , Y2 ,..., Yn. Re-

port the maximum likelihood parameter estimatesθ(ˆ)n (for whatever the distribution's

parameter θ may be) as well as asymptotic 90% con dence intervals for the maximum expected log-likelihood parameter θ* , for each of the three models. You may assume the model is correctly speci ed in each case.

[5 Marks]

(b) For each k ∈ M = {Poisson, geometric, n. binomial}, compute AIC(k, n) and BIC(k, n). Using the computed information criteria, make a decision regarding which of the three models you would prefer and provide reasoning regarding your choice. [5 Marks]

As an alternative, we can suppose that Y : Ω → R>0, since the age reported as integers can be considered as imprecise reporting of an underlying continuous age variable.

the data Y1 , Y2 ,..., Yn. Report the maximum likelihood parameter estimates θ(ˆ)n (for

whatever the distribution's parameter θ may be) as well as asymptotic 90% con dence intervals for the maximum expected (limiting) log-likelihood parameter θ* for each of the three models. You may assume the model is correctly speci ed in each case. [5 Marks]

(d) For each k ∈ M = {exp., gamma, l. normal}, compute AIC(k, n) and BIC(k, n). Using the computed information criteria, make a decision regarding which of the three models you would prefer and provide reasoning regarding your choice. [5 Marks]

(e) Treating age as either discrete or continuous, there is a possibility that our assumption about age is wrong, whether it be regarding its type or the model that we suggest to model it with. Under this possibility of misspeci cation, provide alternative 90% con dence intervals for your chosen models from parts (b) and (d). [5 Marks]

Problem 2 [25 Marks]

The gss2010 data set from the openintro R package contains N = 2044 replicates of pairs (X1 , Y1 ) , (X2 , Y2 ) , . . . of random variables grass Y : Ω → {0, 1} = {Legal, Not Legal}, describing whether the surveyed individual believe that marijuana should be legalized, as well as covariates

X ⊤ = (X1 , X2,X3 , X4 )

= (hrsrelax, mentlhlth, hrs, degree) ,

describing the average number of leisure hours, mental health status (as number of days in the last 30 where mental health was not good), number of work hours, and educational attainment of the surveyed individual, respectively. We note that of all the N pairs, there are only n = 690 pairs of covariates and responses that do not contain missing data. We shall only make use of these n complete observations.

(a) Provide expressions for the conditional probability mass function of {Y |X = x} and the mean function

E[Y |X = x]

that is hypothesized when modeling the data via a logistic GLM, with Y regressed on all available covariates.

[5 Marks]

(b) Fit a logistic GLM to the data and report the parameter estimate θ(ˆ)n as well as indi-

vidual asymptotic 90% con dence intervals for each of the elements of the maximum expected log-likelihood parameter θ* = (β0(*),β1(*),...,β4(*))⊤ . [5 Marks]

(c) Using the tted GLM from Part (b), test the null hypothesis that the attitude towards marijuana is only dependent on the covariate corresponding to individual educational attainment, at the α = 0.01 level. Provide a simultaneous asymptotic 99% con dence set for the parameter vector θ* of this null model. [5 Marks]

(d) Using the Akaike information and Bayesian information criteria, comment on whether the null model from Part (c) is a good t to the data. [5 Marks]

(e) Again, using the Akaike information and Bayesian information criteria, assess whether the null model, with logistic mean function

µ(x) = 1 x)

is the most appropriate among the available choices of models in R. Provide an expres- sion for the conditional expectation

E[Y |X4 = x4]

corresponding to the best model choice and provide an interpretation of the estimated parameters θ(ˆ)n of the best model. [5 Marks]

Problem 3 [25 Marks]

The Mroz data set from the carData R package contains n = 752 pairs (X1 , Y1 ) , (X2 , Y2 ) , . . . composing of data regarding married women's participation in the labor force in the USA. The pairs are each replicates of (X, Y), where Y : Ω → R≥0 is the family's income, exclusive of the woman. The covariates of interest are

X ⊤ = (X1 , X2,X3 , X4 , X5 , X6 ) ,

corresponding to the labor force participation (X1 ; No or Yes), the number of children under 5 (X2 ; integer), the number of children between 6 and 18 (X3 ; integer), the woman's age (X4 ; in years), whether the woman attended college (X5 ; No or Yes), and whether the woman's husband attended college (X6 ; No or Yes).

(a) Fit a gamma GLM to the data using the mean model

µ(x) = exp(x) ,

and report parameter estimates for each of the regression parameters, as well as indi- vidual asymptotic 90% con dence intervals. Additionally, test the null hypothesis that Y is independent of the covariates X, versus the alternative, that Y is dependent on some covariate in X . [5 Marks]

(b) Use the Akaike information and Bayesian information criteria to select subsets of covari- ates that are most relevant in explaining the relationship between Y and X . Provide the parameter estimates and for each of the two best models (according to the AIC and BIC), and provide asymptotic 90% simultaneous con dence sets for the maximum expected log-likelihood parameter θ* for the two best models. [5 Marks]

(c) Use results from Section 2.3.2 of the lecture notes to devise a pair of penalty functions pen(n,k) for an information criterion

IC(k, n) = −2ℓk,n ( θ(ˆ)k,n ;⑨n ）+ ,

such that, for models M = {1, 2,..., c}, the selection rule

k(ˆ)n = arg min IC(k, n)

k∈M

selects the most parsimonious model, as per the main result of Section 2.3.2.1. One of your penalty functions should satisfy ICC1 and the other should satisfy ICC2. You must provide su cient justi cation as to why the parsimonious selection result holds for your two penalties, and you may not use a penalty function that has already been suggested (i.e., the AIC, BIC, TIC, HQIC, or any of the SWIC examples). [10 Marks]

(d) Repeat Part (b) but instead of selecting using the AIC and BIC, use your two criteria devised in Part (c). [5 Marks]

Problem 4 [25 Marks]

The Arrests data set from the carData R package contains n = 5226 pairs (X1 , Y1 ) , (X2 , Y2 ) , . . . pertaining to individuals arrested for small-quantity possession of marijuana in Toronto, Canada. The pairs are each replicates of (X, Y), where (release) Y : Ω → {0, 1} = {No, Yes} describes whether the arrestee was released with a summons, as well as covariates

X ⊤ = (X1 , X2,X3 , X4 , X5 , X6 , X7 ) ,

corresponding to the color of the arrestee (X1 ; Black or White), the year of the arrest (X2 ), the age of the arrestee (X3 ; in years), the sex of the arrestee (X4 ; Female or Male), the employment status (X5 ; No or Yes), the citizenship status (X6 ; No or Yes), and the number of police databases that the arrestee appeared in (X7 ; number between 0 an 6).

(a) Fit a logistic GLM to the data and provide an expression for the estimated mean function

ˆ(µ)n (x) = µ (θ⊤x) ,

where x⊤ =(1, x⊤) and θ⊤ =(β0 , β ⊤), and interpret the function. Furthermore, test the null hypothesis that Y is independent of the covariate vector X, versus the alternative hypothesis that there is some covariate upon which Y depends. [5 Marks]

(b) Using the Akaike information or Bayesian information criteria, choose a subset of co- variates that provide a good t to the data. [5 Marks]

ing to the solutions of the LASSO and elastic net problems:

θ(ˆ)λ(L)so = arg max ℓn (θ;⑨n ) − λ |βj |

θ j=1

and

θ(ˆ)λ(E),n(.Net) = arg θ(m)ax ℓn (θ;⑨n ) − λ |βj | + |βj | 2 } ,

with respect to some sequence of penalty constants λ > 0. [5 Marks]

(d) By means of an information criterion or any other method, select optimal values of the constants λ corresponding to the two penalties, and justify your choices. Provide the expressions for the estimated mean functions

ˆ(μ)λ(L)so (x) = μ ( θ(ˆ)λ(L)so⊤x）

and

ˆ(μ)λ(E),.n(Net) (x) = μ ( θ(ˆ)λ(E),n(.Net)⊤x）.

[5 Marks]

(e) Use the re tting method to estimate models based on the optimal choices made via the LASSO and elastic net penalizations. Since the re tted models are likely to be misspeci ed, use sandwich estimators to produce inference regarding the signi cance of the overall models, i.e., test whether Y is correlated with any of the covariates selected by the LASSO and elastic net, and compare and comment on the di erences between the estimated parameters after re tting and those obtained in Part (d). [5 Marks]