STAT 4052 Lab 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT 4052 Lab 2
2022-09-19
STAT 4052 LAB II
Linear Regression
In this lab, we will go over an example of Multiple Linear Regression model with both continuous and categorical covariates. We will especially focus on assessing model assumptions. We will use CPS1985 Data set from the library AER. Load the data as follows
#library(AER)
data("CPS1985")
summary(CPS1985)
## wage education experience age
## Min . : 1 .000 Min . : 2 .00 Min . : 0 .00 Min . :18 .00
## 1st Qu . : 5 .250 1st Qu . :12 .00 1st Qu . : 8 .00 1st Qu . :28 .00
## Median : 7 .780 Median :12 .00 Median :15 .00 Median :35 .00
## Mean : 9 .024 Mean :13 .02 Mean :17 .82 Mean :36 .83
## 3rd Qu . :11 .250 3rd Qu . :15 .00 3rd Qu . :26 .00 3rd Qu . :44 .00
## Max . :44 .500 Max . :18 .00 Max . :55 .00 Max . :64 .00
## ethnicity region gender occupation sector
## cauc :440 south:156 male :289 worker :156 manufacturing: 99 ## hispanic: 27 other:378 female:245 technical :105 construction : 24 ## other : 67 services : 83 other :411 ## office : 97 ## sales : 38
## management: 55
## union married
## no :438 no :184
## yes: 96 yes:350
##
##
##
##
Let us first fit a linear regression model with the variable wage as the response and education, experience, gender and occupation as covariates.
m1<-lm(wage~education+experience+gender+occupation,data=CPS1985)
summary(m1)
##
## Call:
## lm(formula = wage ~ education + experience + gender + occupation,
## data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## - 11 .214 -2 .694 -0 .467 2 .062 35 . 186
##
## Coefficients:
## Estimate Std . Error t value Pr(> |t |)
## (Intercept) - 1 .27036 1 .33024 -0 .955 0 .34002
## education 0 .71670 0 .09905 7 .235 1 .66e- 12 ***
## experience 0 . 10298 0 .01651 6 .238 9 . 11e- 10 ***
## genderfemale - 1 .96488 0 .41629 -4 .720 3 .03e-06 ***
## occupationtechnical 1 .39132 0 .69419 2 .004 0 .04556 *
## occupationservices - 1 .42705 0 .61205 -2 .332 0 .02010 *
## occupationoffice -0 .85042 0 .63211 - 1 .345 0 . 17909
## occupationsales - 1 .62564 0 .81085 -2 .005 0 .04549 *
## occupationmanagement 2 .41271 0 .75216 3 .208 0 .00142 **
## ---
## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’ . ’ 0 . 1 ’ ’ 1
##
## Residual standard error: 4 .327 on 525 degrees of freedom
## Multiple R-squared: 0 .3017, Adjusted R-squared: 0 .291
## F-statistic: 28 .35 on 8 and 525 DF, p-value: < 2 .2e-16
Here, notice that lm functions helps you create dummy variables automatically.
Coefficient Interpretations
The coefficients for continuous variables correspond to the expected change in wage variable upon changing the correponding covariate by 1 unit. For example, expected wage per hour increases by $0.72 for increase
in education by 1 year.
The coefficients for the dummy variables represent the expected difference in wage variable between baseline class and the class corresponding to the dummy variable. For example, the expected wage per hour for males is $1.96 higher than that for females. The expected wage per hour for occupation class of worker is $1.39 less than that for occupation class of technical.
Dummy variables and fastDummies
Note that education and experience are continuous variables, gender is a categorical variable with 2 classes and occupation is a categorical variable with 6 classes. Hence, the model has a dummy variable for gender
and 5 dummy variables for occupation. This is useful because some statistical learning tool are coded in a way that they do not necessarily construct dummies automatically.
# install .packages("fastDummies")
#library(fastDummies)
dat=dummy_cols (CPS1985,select_columns = c ("gender" , "occupation"))
#Also could use this easy way, but this gives -1/1 rather than 0/1, you might need to change it .
dat_mat<-model .matrix (m1)
head(dat_mat)
## (Intercept) education experience genderfemale occupationtechnical
## 1 1 8 21 1 0
## 1100 1 9 42 1 0
## 2 1 12 1 0 0
## 3 1 12 4 0 0
## 4 1 12 17 0 0
## 5 1 13 9 0 0
## occupationservices occupationoffice occupationsales occupationmanagement
## 1 0 0 0 0 ## 1100 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0
Multicollinearity
We will look at three tools for assessing multicollinearity in linear regression models: Scatterplot matrix, Correlation matrix and VIF. Here, we create Scatterplot matrix and Correlation matrix corresponding to
three variables: education, experience and age. We also demonstarte how to calculate VIF values using vif function in car package.
#Scatterplot matrix
pairs(~education+experience+age,data=CPS1985) #not include response(here is wage)
education |
5 10 15
40
experience |
age
|
20 30 40 50 60
pairs(wage~education+experience+age,data=CPS1985) #include response(wage)
20 30 40 50 60
wage |
education |
experience
age |
0 10 20 30 40
#Correlation matrix
cor (CPS1985[,2:4])
## education experience age
## education 1 .0000000 -0 .3526764 -0 . 1500195
## experience -0 .3526764 1 .0000000 0 .9779612
## age -0 .1500195 0 .9779612 1 .0000000
#VIF
library(car)
m2<-lm(wage~education+experience+age,data=CPS1985)
vif (m2)
## education experience age
## 229 .5738 5147 .9190 4611 .4008
m3<-lm(wage~education+age,data=CPS1985)
vif (m3)
## education age
## 1 .023024 1 .023024
m4<-lm(wage~education+experience,data=CPS1985)
vif (m4)
## education experience
## 1 . 142049 1 . 142049
We notice that variables experience and age show a strong linear dependence(which makes intuitive sense, people with higher age should have higher work experience in general). Hence, including both experience and age variables in a linear regression model will lead to multicollinearity issues. We can confirm that using extremely high VIF values for terms in model m2. However, when one of the variable is removed(experience in as of model m3, age in case of model m4), the VIF values seem to be reasonably low.
Diagnostic Plots
We use three main diagnostic plots for assessing model assumptions:
1. Scatter plot of the fitted values for the outcome versus the standardized residuals
2. QQ-plot of the standardized residuals
3. Contours plot of Cook’s distance as a function of leverage and standardized residuals We can use plot function with lm object to get the required plots
par (mfrow=c (1 ,3))
plot (m1,which=c (1 ,2 ,5))
Residuals vs Fitted
170 |
409 106
|
|
0 5 10 15
Fitted values
Normal Q−Q
170
409 106
|
−3 −1 0 1 2 3
Theoretical Quantiles
Residuals vs Leverage
|
170
409 494
Cook's distance
0.00 0.02 0.04 0.06 Leverage |
0.5 |
Note that plot(m1) can output a total of 6 diagnostic plots, out which 1, 2 and 5 are the ones we need here. You can play around with different numbers between 1 and 6 and check out other diagnostic plots(for e.g. try plot(m1,which=c(3,4)). We see for the residual vs fitted values scatter plot that residuals show slight non-linear and heteroscedastic nature. Including higher-order terms for continuous covariates might be able to solve this issue. The normality of residuals seems to be roughly satisfied from the qqplot. From the contours plot of Cook’s distance, we can observe that the data point 170 has a high residual value, but satisfies Cook’s distance less than 0.5, hence we wouldn’t identify it as an influential point.
2022-09-25