关键词 > STAT3500/STAT7500

STAT3500/STAT7500 Assignment 3—Linear Regression

发布时间：2023-10-16

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT3500/STAT7500 Assignment 3—Linear Regression

Due Date: 9th october 2023, 5 PM.

weighting: 25%

Instructions

. The assignment consists of 4 (four) problems, each problem is worth 25 marks, and each mark is equally weighted.

. The mathematical elements of the assignment can be completed by hand, in LaTex (prefer- ably), or in word (or other typesetting software). The mathematical derivations and ma- nipulations should be accompanied by clear explanations in English regarding necessary information required to interpret the mathematical exposition.

. computation problems should be answered using programs in the R language.

. computer generated plots and hand drawn graphs should be included together with the text where problems are answered.

. submission iles should include the following (which ever applies to you):

— scans of handwritten mathematical exposition.

— Typeset mathematical exposition, outputted as a pdf ile.

— Typeset answers to computational problems, outputted as a pdf ile.

— program code/scripts that you wish to submit, outputted as a txt ile.

. Mathematical problems should be answered with reference to results presented in the course Notes (refer to page numbers), if required. If a mathematical result is used that is not presented in the Lecture Notes, then its common name (e.g., “Bayes, Theorem”, “Intermediate value Theorem”, “Borel—cantelli Lemma”, etc.) should be cited, or else a reference to a text containing the result should be provided (preferably a textbook).

. All submission iles should be labeled with your name and student number and archived together in a 处ip ile and submitted at the TurnItIn link on Blackboard. we suggest naming using the convention:

[LastName-FirstName/StudentNumber]-STAT3500A3- [AnythingElse].[FileExtension].

. As per https://my.uq.edu.au/information-and-services/manage-my-program/stud ent-integrity-and-conduct/academic-integrity-and-student-conduct；what you submit should be your own work. Even where working from sources； you should endeavor to write in your own words. You should use consistent notation throughout your assignment and deine whatever is required.

problem 1 [25 Marks]

Let X be a random variable where X E {0, 1,..., K} for some constant K E ” . using X；wedeine

X (K) = ‘X = kI =〈 1

if X = k；

if X k；

and together； we call (X (1) ,..., X (代) ) the dummg uaTiable TepTesentation of X. Treating X as a

categorical variable；we can model the relationship between Y and X via the regression relationship

E[Y |X = ①] = β0 + β1① (1) + ...β代 ① (代)； (1)

where ① is a realization of X；and

① (K) = ‘① = kI

for each k E [K].

(a) using (1)；provide expressions for the expected value of Y；given X = ①: E[Y |X = ①]； for each ① E {0, 1,..., K}；in terms of β0 and β： = (β1 ,...,β代 ). using the expecta- tion expressions E[Y |X = ①]；provide an interpretation of the parameters β0 and the elements β1 ,β2 ,...,β代 of β; in regards to the relationship between Y and X.

[5 Marks]

The chickwts dataset from the package datasets in R provides data regarding an experiment of weight gain in chicken when given diferent feed type. Here we can consider weight；in grams； as the response Y and the feed type as the random categorical variable X E {0, 1,..., K}；K = 5. we can assume that the data consists of realizations (①1 , g1 ) , . . . (①n , gn )；where (①i , gi ) is a realization of the random pair (xi , Yi )；IID for each i E [n]；where each (xi , Yi ) has the same DGP as (X, Y)； for n = 71.

(b) under the assumption that the model

{Y |X = ① } = β0(*) + β1(*)① (1) + ...β代(*)① (代) + E,

where E … N(0）σ*2), is correctly speciied, for some θ*」 =(β0(*)）β *」）σ *2）, estimate

the parameter θ* , and test the null hypothesis H0 : Y is independent of X, versus the alternative hypothesis: Y and X are dependent, at the a = 0.01 signiicance level,

“

using a X2 test. For each k E {1）. . . ）5} provide an interpretation of the estimate βn,k

of βk(*) .

[5 Marks]

{Y |X = ① } = β0(*) + β1(*)① (1) + ...β代(*)① (代) + E,

where E … N(0）σ*2), is a misspeciied model for some θ*」 =(β0(*)）β *」）σ *2）. Estimate

the parameter θ* , and again test the null hypothesis H0 : Y is independent of X, versus the alternative hypothesis H1 : Y and X are dependent, at the a = 0.01 signiicance level.

[5 Marks]

consider now that x」 = (U）Ⅴ ), where U is a categorical random variable on {0）...）K} and Ⅴ is a real random variable. Then, we can model the relationship between Y and x as

代代

E[Y |x = ①] = β0 + βku(k) + Tv +δku(k)v , (2)

where x」 = (u）v) is a realization of x and u(k) = ‘u = kI, for each k E [K].

(d) using (2), provide expressions for the expected value of Y , given U = u and Ⅴ = v:

E lY |x = (u）v)」], for each u E {0）1）...）K}, in terms of

θ 」 = (β0）β1）...）β代）T）δ1）...）δ代 ) .

provide an interpretation of the parameter δk , for each k E [K].

[5 Marks]

The babie网 dataset from the package openintro in R provides data regarding the birth weights of babies born in the san Francisco Bay area between 1960 and 1967. Here we can consider the birth weight, in ounces, as the response Y, the smoke label (whether the mother is a smoker) U E {0）1} as a categorical variable, and the gestation time Ⅴ, in days, as a real random variable.

we can assume that the set of data (xi , gi ), for i E [n], is a realization of an IID random sample

(x1 , Y1 ) , (x2 , Y2 ) , . . . (xi(」) = (Ui ,Ⅴi )), where each (xi , Yi ) has the same DGP as (x」 ,Y,, for

n = 1236.

(e) under the assumption that

{Y |X = ① } = β0(*) + β1(*)u(1) + T*U + δ1(*)u(1)U + E,

where E “ N(0,σ*2), is misspeciied, for some θ*」 = (β0(*),β1(*), T* ,δ1(*),σ *2), estimate the parameter θ* , and test the null hypothesis H0 : Y is independent of U, versus the alternative hypothesis H1 : Y and U are dependent, at the a = 0.01 signiicance level. Further provide an estimate of the formula for the average weight of a baby, as a function of gestation time, when the mother is a smoker.

[5 Marks]

problem 2 [25 Marks]

The GAGurine dataset from the package MASS in R provides data regarding the concentration of a chemical (GAG) in the urine of children. Here we can let the GAG concentration be the response Y and the age of the child, in years, be the real variable X. we can assume that the pair (①i , gi ), is a realization of the random pair (Xi , Yi ), which is IID for i E [n], and has the same DGP as (X, Y), for n = 314.

(a) Make the misspeciied assumption that

{Y |X = ① } = β0(*) + β1(*)① + E,

where E “ N(0,σ*2) for some θ*」 = (β0(*),β1(*),σ *2), estimate the parameter θ* , and test the hypothesis H0 : there is no decreasing relationship between X and Y versus the alternative hypothesis H1 : there is a decreasing relationship between X and Y.

“

Provide an interpretation for the estimated parameter β1n.

[5 Marks]

(b) clearly the relationship between the GAG concentration and the age of the child is not linear. one method for testing for nonlinearity is to test the null hypothesis

H0 : {Y |X = ① } = β0(*) + β1(*)① + E

against a nonlinear model alternative hypothesis, say

H1 : {Y |X = ① } = β0(*) + β1(*)① + β2(*)①2 + β3(*)①3 + β4(*)①4 + E,

for some (β2(*),β3(*),β4(*)) (0, 0, 0), where we make the misspeciied assumption that E … N(0,σ*2). conduct the test between H0 and H1 above at the a = 0.05 level of signiicance, and interpret the result of your test.

[10 Marks]

(c) State the assumptions that you have made so that your constructed test statistic for part (b) has the necessary regularity to be asymptotically normal. [5 Marks]

(d) In part (b), we test the null hypothesis of linearity against a nonlinear alternative model, characterized by a polynomial. In principle, we can propose an alternative model using any class of non-linear functions that includes the linear model as a special case. propose a variation of the test from part (b) and implement your test. comment on any diferences with the results from part (b). [5 Marks]

problem 3 [25 Marks]

The satgpa dataset from the package openintro in R provides data regarding the university entrance scores (SAT) and grade point averages (GpAs) of students at an unnamed American university. Here we can consider the irst year GpA as the response Y and the (total) SAT score u and high school GpAs Ⅴ as the covariate x」 = (u,Ⅴ ). we can assume that the set of data (xi , gi ), for

i E [n], is a realization of an IID random sample (x1 , Y1 ) , (x2 , Y2 ) , . . . (xi(」) = (ui ,Ⅴi )), where

each (xi , Yi ) has the same DGp as (x」 ,Y,, for n = 1000.

(a) under misspeciication, suppose that

{Y |x = x} = β0(*) + β1(*)u + β2(*)U + E,

where E … N(0,σ*2), for some θ*」 = (β0(*),β1(*),β2(*),σ *2), and estimate the parameters θ* . provide individually valid asymptotic 100(1 - a)% conidence intervals β1(*) and β2(*), at the 1 - a = 0.9 level.

[5 Marks]

(b) using the conidence interval constructions from Part (a), produce an asymptotic 100(1 - a)% conidence set ca,n , such that

P((β1(*),β2(*)) e ca,n ) > 1 - a,

approximately, at the 1 - a = 0.9 level.

[5 Marks]

(c) under the same assumptions as in Part (a), construct an asymptotic 100(1 - a)% conidence ellipse for the parameters (β1(*),β2(*)), at the 1 -a = 0.9 level. That is, produce an elliptical asymptotic conidence set Ea,n , such that

P((β1(*),β2(*)) e Ea,n ) > 1 - a,

approximately, for 1 - a = 0.9. Plot the obtained conidence ellipse along with the conidence set from Part (b).

[10 Marks]

(d) under misspeciication, test the following null hypothesis, at the a = 0.1 level:

H0 : {Y |X = x} = β0(*) + β1(*)u + β2(*)U + E,

versus the alternative

H1 : {Y |X = x} = β0(*) + β1(*)u + β2(*)U + T*uU + E,

for some T* 0, where E[E] = 0 and var(E) = σ *2 < 钝. comment on the implica- tions of your test outcomes.

[5 Marks]

problem 4 [25 Marks]

The gpa-study-hours dataset from the package openintro in R provides data regarding the grade point averages (GPAs) of students at an unnamed American university, along with the amount of hours spent studying. Here we can consider the GPA as the response Y and the study hours as the covariate x. we can assume that the pair (①i , gi ), is a realization of the random pair (xi , Yi ), which is IID for i e [n], and has the same DGP as (x, Y), for n = 193.

(a) under misspeciication, suppose that

{y |x = ① } = β0(*) + β1(*)① + E,

where E … N(0,σ*2), for some θ*」 = (β0(*),β1(*),σ *2), and estimate the parameters θ* . Interpret the estimate for the coe伍cient β1(*) .

[5 Marks]

It is easy to see that the variance of {y |x = ① } decreases as ① . we require a modiication to our model in part (a), in order to account for and model this variance pattern. A possible alternative model is

{y |x = ① } = β0(*) + β1(*)① +“exp(T0(*) + T1(*)①)E,

where E … N(0, 1), for some θ*」 = (β0(*),β1(*), T0(*), T1(*)).

(b) write the conditional probability density function of {y |x = ① } and interpret the parameters T0(*) E R and T1(*) E R in the model above.

[5 Marks]

(c) using the optim function in R, or by any other means, estimate the parameter θ* by maximum (conditional) likelihood estimation.

[10 Marks]

(d) under the assumption of correct speciication, at the a = 0.01 level of signiicance, test the null hypothesis H0 , that the variance of {y |x = ① } is not a function of ①, versus the alternative hypothesis H1 , that the variance of {y |x = ① } is a function of ① .

[5 Marks]