RClass1-LinearModelReview
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
RClass1-LinearModelReview
2022
Review of some aspects of the Linear Model
Continuous vs. factor predictors
Load the libraries foreign (to be able to read in the data file) and arm, and read in the data kidiq. This is how it works on my computer:
library(foreign)
library(arm)
kidiq <- read.dta("kidiq.dta")
attach(kidiq)
head(kidiq)
## |
kid_score |
mom_hs |
mom_iq |
mom_work |
mom_age |
## 1 |
65 |
1 |
121.11753 |
4 |
27 |
## 2 |
98 |
1 |
89.36188 |
4 |
25 |
## 3 |
85 |
1 |
115.44316 |
4 |
27 |
## 4 |
83 |
1 |
99.44964 |
3 |
25 |
## 5 |
115 |
1 |
92.74571 |
4 |
27 |
## 6 |
98 |
0 |
107.90184 |
1 |
18 |
The response is kid_score, the child’s score in an IQ test, mom_hs (whether or not the mom has a highschool degree), mom_iq (mom’s IQ), mom_work (mom’s working pattern in the first years of the child’s life, as described in Lecture 2a), and mom_age (mom’s age).
Next, fit two linear models to these data:
fit1, which has mom_iq and mom_work as main effects and no interaction. In this model, mom_work is treated as a factor predictor.
fit1 <- lm(kid_score~as.factor(mom_work)+mom_iq,x=TRUE)
summary(fit1)
##
## Call:
## lm(formula = kid_score ~ as.factor(mom_work) + mom_iq, x = TRUE)
##
## Residuals:
## Min 1Q ## -57.796 -12.103 ##
## Coefficients:
##
## (Intercept)
Median 3Q Max
1.892 12.019 50.582
Estimate Std. Error t value Pr(>|t|)
24.14226 6.14276 3.930 9.89e-05 ***
## as.factor(mom_work)2 3.97026 2.78980 1.423 0.1554
## as.factor(mom_work)3 6.60140 3.23986 2.038 0.0422 *
## as.factor(mom_work)4 3.06392 2.44682 1.252 0.2112
## mom_iq 0.59478 0.05942 10.009 < 2e-16 ***
## ---
## Signif. codes: 0 !*** ! 0.001 !** ! 0.01 !* ! 0.05 ! . ! 0.1 ! ! 1
##
## Residual standard error: 18.24 on 429 degrees of freedom
## Multiple R-squared: 0.2091, Adjusted R-squared: 0.2018
## F-statistic: 28.36 on 4 and 429 DF, p-value: < 2.2e-16
Explain how many parameters this model has and why: This model has five parameters, one for the intercept,
three for mom_work (because it is a factor with four levels), and one for mom_iq (because it is a continous predictor).
fit2, which has mom_iq and mom_work as main effects and no interaction. However, mom_work is treated as a continuous predictor.
fit2 <- lm(kid_score~mom_work+mom_iq,x=TRUE)
summary(fit2)
##
## Call:
## lm(formula = kid_score ~ mom_work + mom_iq, x = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.281 -12.137 1.976 12.167 48.781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.54194 6.10417 4.021 6.85e-05 ***
## mom_work 0.63140 0.74823 0.844 0.399
## mom_iq 0.60427 0.05893 10.254 < 2e-16 ***
## ---
## Signif. codes: 0 !*** ! 0.001 !** ! 0.01 !* ! 0.05 ! . ! 0.1 ! ! 1
##
## Residual standard error: 18.27 on 431 degrees of freedom
## Multiple R-squared: 0.2023, Adjusted R-squared: 0.1986
## F-statistic: 54.64 on 2 and 431 DF, p-value: < 2.2e-16
Explain how many parameters this model has and why: This model has three parameters, one for the intercept,
one for mom_work (because it is now a continuous predictor), and one for mom_iq (because it is a continous predictor). Note that treating mom_work as a continuous predictor makes sense because its values are ordered, but it remains a question whether we want to treat the differences between each category of mom_work the same (which this model does).
For fit1 and fit2, print the first 5 rows of the design matrix.
fit1$x[1:5,]
## (Intercept) as.factor(mom_work)2 as.factor(mom_work)3 as.factor(mom_work)4
## 1 1 0 0 1 ## 2 1 0 0 1 ## 3 1 0 0 1 ## 4 1 0 1 0 ## 5 1 0 0 1
## ## 1 ## 2 ## 3 ## 4 ## 5 |
mom_iq 121.11753 89.36188 115.44316 99.44964 92.74571 |
fit2$x[1:5,]
## (Intercept) mom_work mom_iq
## 1 1 4 121.11753
## 2 1 4 89.36188
## 3 1 4 115.44316
## 4 1 3 99.44964
## 5 1 4 92.74571
Comment on how they differ: For fit1, the design matrix has five columns, and the factor predictor takes three of these. The values of these columns are either 0 or 1, for example, there is a 1 in the second column of the design matrix if mom_work ==2 and 0 otherwise, and so on. In contrast, the design matrix for fit2 has only three columns: the first and last are the same as the first and last column of the design matrix for fit1, but the second column is now different. Here, the entry is the value of mom_work.
Plot the data along with the fitted regression curve(s): For fit1, I get
plot(mom_iq,kid_score, xlab= "Mother IQ score" ,
ylab= "Child test score" ,pch=20 , xaxt="n" , yaxt="n" , type="n")
curve (coef(fit1)[1] + coef (fit1)[2] + (coef(fit1)[5])*x, add=TRUE , col="magenta") curve (coef(fit1)[1] + coef (fit1)[5]*x, col="red" , add=TRUE)
curve (coef(fit1)[1] + coef (fit1)[3] + (coef(fit1)[5])*x, add=TRUE , col="blue") curve (coef(fit1)[1] + coef (fit1)[4] + (coef(fit1)[5])*x, add=TRUE , col="black") points (mom_iq[mom_work==1], kid_score[mom_work==1], pch=20 ,col="red")
points (mom_iq[mom_work==2], kid_score[mom_work==2], pch=20 ,col="magenta") points (mom_iq[mom_work==3], kid_score[mom_work==3], pch=20 ,col="blue") points (mom_iq[mom_work==4], kid_score[mom_work==4], pch=20 ,col="black")
axis (1 , c (80 ,100 ,120 ,140))
axis (2 , c (20 ,60 ,100 ,140))
|
140
Mother IQ score
while for fit2, I get
plot(mom_iq,kid_score, xlab= "Mother IQ score" ,
ylab= "Child test score" ,pch=20 , xaxt="n" , yaxt="n" , type="n")
curve (coef(fit2)[1] + 2*coef (fit2)[2] + (coef(fit2)[3])*x, add=TRUE , col="magenta") curve (coef(fit2)[1] + coef (fit2)[2] + (coef(fit2)[3])*x, col="red" , add=TRUE) curve (coef(fit2)[1] + 3*coef (fit2)[2] + (coef(fit2)[3])*x, add=TRUE , col="blue") curve (coef(fit2)[1] + 4*coef (fit2)[2] + (coef(fit2)[3])*x, add=TRUE , col="black")
points (mom_iq[mom_work==1], kid_score[mom_work==1], pch=20 ,col="red") points (mom_iq[mom_work==2], kid_score[mom_work==2], pch=20 ,col="magenta") points (mom_iq[mom_work==3], kid_score[mom_work==3], pch=20 ,col="blue") points (mom_iq[mom_work==4], kid_score[mom_work==4], pch=20 ,col="black") axis (1 , c (80 ,100 ,120 ,140))
axis (2 , c (20 ,60 , 100 , 140))
2022-02-23