Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT6030 GENERALISED LINEAR MODELLING

Assignment 1

2023 Summer Session

Instructions

• This assignment is worth 40 marks in total and 15% of your overall marks for this course. The assignment is compulsory and must be submitted by 5pm on Friday 3 February 2023.

• You must write your answers to this assignment individually and by yourself.  If you copy someone else’s work or allow your work to be copied, you will receive a mark of zero for the assignment and risk severe academic consequences.

• Your answers should be individually submitted through Turnitin  on Wattle  as  a single pdf/Word document (less than 50MB) including the following:

1. The assignment Cover Sheet (available on Wattle).

2. Your answers (no more than 12 pages including graphs, summaries, tables, etc... but not Appendix and Cover Sheet, and respecting the other requirements for each part).

3. An Appendix including all the R commands you used (no page limit).

• Assignments should be typed and not handwritten. Your assignment may include some carefully edited R output (e.g., graphs, summaries, tables, etc...) and appropriate dis- cussion of these results, as well as some selected R commands. Please be selective about what you present and only include as many pages and as much R output as necessary to justify your solution.  Clearly label each part and question of your assignment and appendix with the corresponding numbers.

• Unless otherwise advised, use a significance level of 5%.

• Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012).

• Marks will be deducted if these instructions are not strictly respected, especially when the total report is of an unreasonable length, i.e., more than the above page limit. The Appendix will generally not be marked and checked if what you have written or done needs clarifications.

• Name your submission“CourseCode Uid”, e.g., “STAT6030 u1234567”.

• Try to submit your assignment at least 30 minutes before the deadline in case something unexpected happens, for instance an internet connection problem.

• Late submissions will NOT be accepted.  Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must receive lecturer’s approval at least 24 hours before the deadline.

Part 1                    [30 Marks]

There is a limit of 10 pages on Part  1.  All R outputs, figures and tables are included in the 10 page limit, but the Cover Sheet and Appendix are not included in the page limit. Any supplementary R code that you include should go in the Appendix. The reason for the 10 page limit is that we want you to think carefully about (i) what to include or exclude and (ii) how to present what you do decide to include.  You should aim to be concise in your presentation and discussion of results. Please make sure to express your main findings in words as well.  It is not sufficient just to present numbers, tables or graphs without any explanation. Marks may be deducted if these instructions are not followed.

You will notice that some questions are open-ended in this part of the assignment.   We provide detailed instructions and yet you will find that a lot of practical decisions need to be made. Giving you questions which are in part open-ended is deliberate and reflects what often happens with data analyses that professional statisticians and data scientists undertake in the real world. Do the best you can to deal with these challenges.

The dataset “housing”available on Wattle contains information regarding the housing values in the suburbs of Boston. The data frame has 506 rows and 14 variables (in order):

crim per capita crime rate by town;

zm proportion of residential land zoned for lots over 25,000 sq ft;

indus proportion of non-retail business acres per town;

chas Charles River dummy variable (= 1 if tract bounds river, 0 otherwise); nox nitrogen oxides concentration (parts per 10 million);

rm average number of rooms per dwelling (1 = 6 or less rooms, 2 = 7 rooms, 3 = 8 or more rooms);

age proportion of owner-occupied units built prior to 1940.

dis weighted mean of distances to five Boston employment centres;

rad index of accessibility to radial highways;

tax full-value property-tax rate per USD 10,000;

ptratio pupil-teacher ratio by town;

black 1000(Bk − 0.63)2  where Bk is the proportion of blacks by town; lstat percentage of lower status of the population;

medv median value of owner-occupied homes in USD 1000’s.

This is a modified version of the dataset examined in the tutorial of Topic 1 where the variable rm is now categorical.  The dataset can be loaded with a call to the function read excel() from the R library readxl.

Please provide an answer to the following questions. You are welcome to perform additional exploratory data analysis of the dataset (e.g. by calculating basic summary statistics, obtain- ing scatterplots and box plots, etc...), if helpful.  In tutorials we have seen useful functions such as summary(), plot(), pairs(), and boxplot().  Some of these functions have useful options which you can explore.  Try Googling these functions to find out more and to find additional useful functions.  When selecting a model, perform a more detailed study of the selected model by looking at residual plots, qqplots and plots involving fitted values. As far as possible you should check the assumptions in the model.  Present selected outputs and summarise your findings in words, without forgetting the 10-page limit.

(a)  [2 marks] Produce a scatterplot of medv against nox and add a regression line. Describe

the relationship.

(b)  [2 marks] Perform a simple regression of medv against nox. Using the diagnostic plots,

comment on the model.

(c)  [4 marks] Fit a multiple regression of medv on the rest of the predictors, using only main linear effects (that is, no interactions or polynomial terms). Briefly explain the process by which you select the model.  Produce the diagnostic plots and regression summary and comment on the model.

(d)  [2 marks] Produce a scatterplot of medv against lstat, and include linear, quadratic and cubic regression lines. Describe the presence of potential non-linear relationships.

(e)  [2 marks] Add a suitable polynomial term to your final model from question (c). Produce

the diagnostic plots and compare these against the previous model.  Comment on the suitability of the polynomial term.

(f)  [2 marks] Produce a scatterplot of medv against nox and include the regression lines

grouped by the dummy variable chas. Describe the relationship between chas, nox and medv. Comment on the necessity of including or not the interaction between chas and nox in the regression model.

(g)  [2 marks] Fit a multiple regression with the variables from question (e), adding an

interaction term between rm and lstat and, if you believe necessary, the potential interaction from question (f). Produce the diagnostic plots, compare these against the previous model, and comment on the suitability of the interaction between rm and lstat.

(h)  [4 marks] Using the model from question (g), remove any unnecessary terms, if present,

and include any other factors you think might be appropriate such as interaction and polynomial terms. Comment on the model assumptions. Explain why this model is the most appropriate in your opinion.

(i)  [5 marks] Write down the regression equation for this final model identified in question (h). Explain if the overall regression is significant and provide the coefficient of deter- mination.  Interpret the coefficients of your final model and also compare predictions from the model with relevant intervals.

(j)  [5 marks] Provide a non-technical summary of your main findings aiming to explain

what factors appear to be influential in determining the median value of owner-occupied homes, based on the available data and your chosen model. The summary should target a generally well-informed audience that has no knowledge of statistics at a university level, using simple visualisation tools (figures, graphs, tables, etc...)  and quantities such as percentages and averages.  The summary should be not more than one page including any outputs such as plots, figures or tables. If the summary is too technical then there will be a risk of losing marks.

Part 2                  [10 Marks]

Please provide your answers to the following questions and include short working out if there is any. There is a limit of 2 pages on your answers for Part 2.

(a)  [1 mark] Let βˆ1  denote an estimator of a parameter β1  and βˆ2  denote an estimator of a parameter β2 . Suppose that the standard error of βˆ1  is 1.71, the standard error of βˆ2 is 1.98 and the estimated covariance between βˆ1  and βˆ2  is -0.28. What is the standard error of the estimator (or contrast) 3βˆ1 − 2βˆ2 ?

(b)  [2 marks] In the R output from the lm() function given below, age is a continuous

covariate. What are the null hypothesis H0  and alternative hypothesis H1  the p-value below refers to? What is the outcome of this test?

Coefficients :

age

Estimate 256 . 9

Std .   Error

11 . 9

t  value 21 . 587

Pr(>|t | )

<  2e — 16  ***

(c)  [1 mark] In the lecture notes of Topic 2 we used the notation SS(A|1). What does the

1 in SS(A|1) refer to?

(d)  [1 mark] Suppose we fit a one-way ANOVA with a factor A where the residual sum of squares is SS(res|A) = 2.5 on 15 degrees of freedom and the total sum of squares is SStot  = 41.9 on 17 degrees of freedom.  What is the mean between-groups sum of squares MS(A|1)?

(e)  [1 mark] A linear model with normal errors was fitted to a dataset. Part of the Analysis

of Variance table is shown below.

A

Residuals

Df

3

27

Sum  Sq

272 . 37

4927 . 48

Mean  Sq

+++

+++

F  value

+++

Pr(>F) +++

What are the F  value and the p-value Pr(>F)?

(f)  [1 mark] In a two-way ANOVA we decide to fit the model A ∗ B where factor A has g

levels and factor B has h levels. What are the total degrees of freedom for only the set of interaction terms between A and B?

(g)  [2 marks] Consider a discrete distribution with probability mass function f(y;p) = (y + 1)(1 − p)2py ,

where y = 0, 1, . . . is a non-negative integer, and f(y;p) = 0 otherwise. In the above, p is a positive parameter smaller than 1. This probability mass function can be written as a GLM f(y;θ,ϕ). What are the expressions for ϕ , θ , b(θ) and c(y,ϕ)? What is the variance function V (µ) of this distribution, where µ is the mean of the distribution?

(h)  [1 mark] Which statement below concerning inverse link functions is correct?   Note

that (i) precisely one answer below is correct and the other ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores full marks for the question.

A. The inverse link function is a map from the linear predictor to the mean.

B. The inverse link function is a map from the canonical parameter to the linear pre- dictor.

C. The inverse link function is a map from the mean to the canonical parameter.

D. The inverse link function is a map from the mean to the linear predictor.