闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT 4052 Lab 2

2022-09-19

STAT 4052 LAB II

Linear Regression

In this lab, we will go over an example of Multiple Linear Regression model with both continuous and categorical covariates. We will especially focus on assessing model assumptions. We will use CPS1985 Data set from the library AER. Load the data as follows

#library(AER)

data("CPS1985")

summary(CPS1985)

## wage education experience age

## Min . : 1 .000 Min . : 2 .00 Min . : 0 .00 Min . :18 .00

## 1st Qu . : 5 .250 1st Qu . :12 .00 1st Qu . : 8 .00 1st Qu . :28 .00

## Median : 7 .780 Median :12 .00 Median :15 .00 Median :35 .00

## Mean : 9 .024 Mean :13 .02 Mean :17 .82 Mean :36 .83

## 3rd Qu . :11 .250 3rd Qu . :15 .00 3rd Qu . :26 .00 3rd Qu . :44 .00

## Max . :44 .500 Max . :18 .00 Max . :55 .00 Max . :64 .00

## ethnicity region gender occupation sector

## cauc :440 south:156 male :289 worker :156 manufacturing: 99 ## hispanic: 27 other:378 female:245 technical :105 construction : 24 ## other : 67 services : 83 other :411 ## office : 97 ## sales : 38

## management: 55

## union married

## no :438 no :184

## yes: 96 yes:350

Let us first fit a linear regression model with the variable wage as the response and education, experience, gender and occupation as covariates.

m1<-lm(wage~education+experience+gender+occupation,data=CPS1985)

summary(m1)

## Call:

## lm(formula = wage ~ education + experience + gender + occupation,

## data = CPS1985)

## Residuals:

## Min 1Q Median 3Q Max

## - 11 .214 -2 .694 -0 .467 2 .062 35 . 186

## Coefficients:

## Estimate Std . Error t value Pr(> |t |)

## (Intercept) - 1 .27036 1 .33024 -0 .955 0 .34002

## education 0 .71670 0 .09905 7 .235 1 .66e- 12 ***

## experience 0 . 10298 0 .01651 6 .238 9 . 11e- 10 ***

## genderfemale - 1 .96488 0 .41629 -4 .720 3 .03e-06 ***

## occupationtechnical 1 .39132 0 .69419 2 .004 0 .04556 *

## occupationservices - 1 .42705 0 .61205 -2 .332 0 .02010 *

## occupationoffice -0 .85042 0 .63211 - 1 .345 0 . 17909

## occupationsales - 1 .62564 0 .81085 -2 .005 0 .04549 *

## occupationmanagement 2 .41271 0 .75216 3 .208 0 .00142 **

## ---

## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’ . ’ 0 . 1 ’ ’ 1

## Residual standard error: 4 .327 on 525 degrees of freedom

## Multiple R-squared: 0 .3017, Adjusted R-squared: 0 .291

## F-statistic: 28 .35 on 8 and 525 DF, p-value: < 2 .2e-16

Here, notice that lm functions helps you create dummy variables automatically.

Coefficient Interpretations

The coefficients for continuous variables correspond to the expected change in wage variable upon changing the correponding covariate by 1 unit. For example, expected wage per hour increases by $0.72 for increase

in education by 1 year.

The coefficients for the dummy variables represent the expected difference in wage variable between baseline class and the class corresponding to the dummy variable. For example, the expected wage per hour for males is $1.96 higher than that for females. The expected wage per hour for occupation class of worker is $1.39 less than that for occupation class of technical.

Dummy variables and fastDummies

Note that education and experience are continuous variables, gender is a categorical variable with 2 classes and occupation is a categorical variable with 6 classes. Hence, the model has a dummy variable for gender

and 5 dummy variables for occupation. This is useful because some statistical learning tool are coded in a way that they do not necessarily construct dummies automatically.

# install .packages("fastDummies")

#library(fastDummies)

dat=dummy_cols (CPS1985,select_columns = c ("gender" , "occupation"))

#Also could use this easy way, but this gives -1/1 rather than 0/1, you might need to change it .

dat_mat<-model .matrix (m1)

head(dat_mat)

## (Intercept) education experience genderfemale occupationtechnical

## 1 1 8 21 1 0

## 1100 1 9 42 1 0

## 2 1 12 1 0 0

## 3 1 12 4 0 0

## 4 1 12 17 0 0

## 5 1 13 9 0 0

## occupationservices occupationoffice occupationsales occupationmanagement

## 1 0 0 0 0 ## 1100 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0

Multicollinearity

We will look at three tools for assessing multicollinearity in linear regression models: Scatterplot matrix, Correlation matrix and VIF. Here, we create Scatterplot matrix and Correlation matrix corresponding to

three variables: education, experience and age. We also demonstarte how to calculate VIF values using vif function in car package.

#Scatterplot matrix

pairs(~education+experience+age,data=CPS1985) #not include response(here is wage)

education

5 10 15

experience

age

20 30 40 50 60

pairs(wage~education+experience+age,data=CPS1985) #include response(wage)

20 30 40 50 60

wage

education

experience

age

0 10 20 30 40

#Correlation matrix

cor (CPS1985[,2:4])

## education experience age

## education 1 .0000000 -0 .3526764 -0 . 1500195

## experience -0 .3526764 1 .0000000 0 .9779612

## age -0 .1500195 0 .9779612 1 .0000000

#VIF

library(car)

m2<-lm(wage~education+experience+age,data=CPS1985)

vif (m2)

## education experience age

## 229 .5738 5147 .9190 4611 .4008

m3<-lm(wage~education+age,data=CPS1985)

vif (m3)

## education age

## 1 .023024 1 .023024

m4<-lm(wage~education+experience,data=CPS1985)

vif (m4)

## education experience

## 1 . 142049 1 . 142049

We notice that variables experience and age show a strong linear dependence(which makes intuitive sense, people with higher age should have higher work experience in general). Hence, including both experience and age variables in a linear regression model will lead to multicollinearity issues. We can confirm that using extremely high VIF values for terms in model m2. However, when one of the variable is removed(experience in as of model m3, age in case of model m4), the VIF values seem to be reasonably low.

Diagnostic Plots

We use three main diagnostic plots for assessing model assumptions:

1. Scatter plot of the fitted values for the outcome versus the standardized residuals

2. QQ-plot of the standardized residuals

3. Contours plot of Cook’s distance as a function of leverage and standardized residuals We can use plot function with lm object to get the required plots

par (mfrow=c (1 ,3))

plot (m1,which=c (1 ,2 ,5))

Residuals vs Fitted

170

409

106

0 5 10 15

Fitted values

Normal Q−Q

170

409

106

−3 −1 0 1 2 3

Theoretical Quantiles

Residuals vs Leverage

170

409

494

Cook's distance

0.00 0.02 0.04 0.06

Leverage

0.5

Note that plot(m1) can output a total of 6 diagnostic plots, out which 1, 2 and 5 are the ones we need here. You can play around with different numbers between 1 and 6 and check out other diagnostic plots(for e.g. try plot(m1,which=c(3,4)). We see for the residual vs fitted values scatter plot that residuals show slight non-linear and heteroscedastic nature. Including higher-order terms for continuous covariates might be able to solve this issue. The normality of residuals seems to be roughly satisfied from the qqplot. From the contours plot of Cook’s distance, we can observe that the data point 170 has a high residual value, but satisfies Cook’s distance less than 0.5, hence we wouldn’t identify it as an influential point.

2022-09-25

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言