Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MTH 303

Linear Statistical Models

2023-2024 S1

Task 1(50 marks)

Researchers collected data from 50 different cities of China to study whether air pollution con-  tributes to mortality.The dependent variable for analysis is age adjusted mortality(“Mortality”). The data includes variables measuring demographic characteristics of the cities,variables mea-  suring climate characteristics,and variables recording the pollution potential of three different  air pollutants.Please use  R to build regression models  and  answer  the  following  questions accordingly

#Variable

Description

I   city

City ID

2

JanTemp

Mean January temperature(F)

3

July Temp

Mean July temperature (F)

4

RelHum

Belative Humidity

5

Rain

Annual rainfall(mm

6

PopDensity

Population density

7

Income

Median income

8

HCPot

HC pollution potentia

9

NOxPot

Nitrous Oxide pollution potentia

10

SO2Pot

Sulfur Dioxide pollution potential

11

Mortality

Age adjusted mortality

1.Started   R.Install   packages"readxl".Load    libraries“Hmisc”,“leaps”and   “MASS” . (2 marks)

2.Load  data  from“Mortality  .csv”.Show  part  of the  dataset.                                  (6 marks)

3.Remove“City”by   assigning  null   as   it  is   an   identifier   column.Show  part   of  the   new dataset. (3 marks)

4.Use the new dataset from 3.Build the linear regression model with all variables consid- ered,name it as“model_mortalityO”and conduct a summary of it. (3 marks)

5.Use  cook's  distance  to  detect  the  outliers.Let  the  benchmark  be  4  times  the  mean.A threshold line in red is required to be drawn to mark the benchmark and the outliers should be labeled in your plot.A detailed information of the outliers should be presented by   using“head”function. (8 marks)

6.Remove the detected outliers and buid a new regression model,name it as “model_mortality1” Conduct a summary of it and make appropriate plots to check the normality and ho-

moscedasticity. (12 marks)

7.According to the summary of modelmortality1,what variables are  significant at  1%level here? (4 marks)

8.Conduct  model  selection  using  stepwise  method.Then  using  anova  to  conduct  further selection starting from the final model given by stepwise selection.Conduct a summary on your final model. (6 marks)

9.Comment on the best model in terms of R squared,Adjusted R squared and significant variables based on the summary of it. (6 marks)