Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

RClass1-LinearModelReview

2022

Review of some aspects of the Linear Model

Continuous vs. factor predictors

Load the libraries foreign (to be able to read in the data file) and arm, and read in the data kidiq. This is how it works on my computer:

library(foreign)

library(arm)

kidiq  <- read.dta("kidiq.dta")

attach(kidiq)

head(kidiq)

##

kid_score

mom_hs

mom_iq

mom_work

mom_age

## 1

65

1

121.11753

4

27

## 2

98

1

89.36188

4

25

## 3

85

1

115.44316

4

27

## 4

83

1

99.44964

3

25

## 5

115

1

92.74571

4

27

## 6

98

0

107.90184

1

18

The response is kid_score, the child’s score in an IQ test, mom_hs (whether or not the mom has a highschool degree), mom_iq (mom’s IQ), mom_work  (mom’s working pattern in the first years of the child’s life, as described in Lecture 2a), and mom_age (mom’s age).

Next, fit two linear models to these data:

fit1, which has mom_iq and mom_work as main effects and no interaction. In this model, mom_work is treated as a factor predictor.

fit1  <- lm(kid_score~as.factor(mom_work)+mom_iq,x=TRUE)

summary(fit1)

##

##  Call:

##  lm(formula  =  kid_score  ~  as.factor(mom_work)  +  mom_iq,  x  =  TRUE)

##

##  Residuals:


##         Min            1Q ##  -57.796  -12.103 ##

##  Coefficients:

##

##  (Intercept)


Median            3Q         Max

1.892    12.019    50.582


Estimate  Std.  Error  t  value  Pr(>|t|)

24.14226        6.14276      3.930  9.89e-05  ***


##  as.factor(mom_work)2    3.97026        2.78980      1.423      0.1554

##  as.factor(mom_work)3    6.60140        3.23986      2.038      0.0422  *

##  as.factor(mom_work)4    3.06392        2.44682      1.252      0.2112

##  mom_iq                                   0.59478         0.05942    10.009    < 2e-16 ***

##  ---

##  Signif.  codes:    0  !*** !  0.001  !** !  0.01  !* !  0.05  ! . !  0.1  !  !  1

##

##  Residual  standard  error:  18.24  on  429  degrees  of  freedom

##  Multiple  R-squared:    0.2091,  Adjusted  R-squared:    0.2018

##  F-statistic:  28.36  on  4  and  429  DF,    p-value:  < 2.2e-16

Explain how many parameters this model has and why: This model has five parameters, one for the intercept,

three for mom_work (because it is a factor with four levels), and one for mom_iq (because it is a continous predictor).

fit2, which has mom_iq and mom_work as main effects and no interaction. However, mom_work is treated as a continuous predictor.

fit2  <- lm(kid_score~mom_work+mom_iq,x=TRUE)

summary(fit2)

##

##  Call:

##  lm(formula  =  kid_score  ~  mom_work  +  mom_iq,  x  =  TRUE)

##

##  Residuals:

##           Min             1Q    Median             3Q           Max

##  -56.281  -12.137       1.976    12.167    48.781

##

##  Coefficients:

##                         Estimate  Std.  Error  t  value  Pr(>|t|)

##  (Intercept)  24.54194        6.10417      4.021  6.85e-05  ***

## mom_work          0.63140        0.74823      0.844        0.399

##  mom_iq               0.60427         0.05893    10.254    < 2e-16 ***

##  ---

##  Signif.  codes:    0  !*** !  0.001  !** !  0.01  !* !  0.05  ! . !  0.1  !  !  1

##

##  Residual  standard  error:  18.27  on  431  degrees  of  freedom

##  Multiple  R-squared:    0.2023,  Adjusted  R-squared:    0.1986

##  F-statistic:  54.64  on  2  and  431  DF,    p-value:  < 2.2e-16

Explain how many parameters this model has and why: This model has three parameters, one for the intercept,

one for mom_work (because it is now a continuous predictor), and one for mom_iq (because it is a continous predictor). Note that treating mom_work as a continuous predictor makes sense because its values are ordered, but it remains a question whether we want to treat the differences between each category of mom_work the same (which this model does).

For fit1 and fit2, print the first 5 rows of the design matrix.

fit1$x[1:5,]

##      (Intercept)  as.factor(mom_work)2  as.factor(mom_work)3  as.factor(mom_work)4

##  1                      1                                       0                                       0                                        1 ##  2                      1                                       0                                       0                                        1 ##  3                      1                                       0                                       0                                        1 ##  4                      1                                       0                                        1                                       0 ##  5                      1                                       0                                       0                                        1


##

##  1

##  2

## 3

## 4

##  5

mom_iq

121.11753 89.36188 115.44316 99.44964 92.74571


fit2$x[1:5,]

##      (Intercept)  mom_work        mom_iq

##  1                      1               4  121.11753

##  2                      1               4    89.36188

##  3                      1               4  115.44316

##  4                      1                3    99.44964

##  5                      1               4    92.74571

Comment on how they dier: For fit1, the design matrix has five columns, and the factor predictor takes three of these. The values of these columns are either 0 or 1, for example, there is a 1 in the second column of the design matrix if mom_work  ==2 and 0 otherwise, and so on. In contrast, the design matrix for fit2 has only three columns: the first and last are the same as the first and last column of the design matrix for fit1, but the second column is now different. Here, the entry is the value of mom_work.

Plot the data along with the tted regression curve(s): For fit1, I get

plot(mom_iq,kid_score,  xlab= "Mother  IQ  score" ,

ylab= "Child  test  score" ,pch=20 ,  xaxt="n" ,  yaxt="n" ,  type="n")

curve (coef(fit1)[1] + coef (fit1)[2] + (coef(fit1)[5])*x,  add=TRUE ,  col="magenta") curve (coef(fit1)[1] + coef (fit1)[5]*x,  col="red" ,  add=TRUE)

curve (coef(fit1)[1] + coef (fit1)[3] + (coef(fit1)[5])*x,  add=TRUE ,  col="blue") curve (coef(fit1)[1] + coef (fit1)[4] + (coef(fit1)[5])*x,  add=TRUE ,  col="black") points (mom_iq[mom_work==1],  kid_score[mom_work==1],  pch=20 ,col="red")

points (mom_iq[mom_work==2],  kid_score[mom_work==2],  pch=20 ,col="magenta") points (mom_iq[mom_work==3],  kid_score[mom_work==3],  pch=20 ,col="blue") points (mom_iq[mom_work==4],  kid_score[mom_work==4],  pch=20 ,col="black")

axis (1 , c (80 ,100 ,120 ,140))

axis (2 , c (20 ,60 ,100 ,140))


140

Mother IQ score

while for fit2, I get

plot(mom_iq,kid_score,  xlab= "Mother  IQ  score" ,

ylab= "Child  test  score" ,pch=20 ,  xaxt="n" ,  yaxt="n" ,  type="n")

curve (coef(fit2)[1] + 2*coef (fit2)[2] + (coef(fit2)[3])*x,  add=TRUE ,  col="magenta") curve (coef(fit2)[1] + coef (fit2)[2] + (coef(fit2)[3])*x,  col="red" ,  add=TRUE) curve (coef(fit2)[1] + 3*coef (fit2)[2] + (coef(fit2)[3])*x,  add=TRUE ,  col="blue") curve (coef(fit2)[1] + 4*coef (fit2)[2] + (coef(fit2)[3])*x,  add=TRUE ,  col="black")

points (mom_iq[mom_work==1],  kid_score[mom_work==1],  pch=20 ,col="red") points (mom_iq[mom_work==2],  kid_score[mom_work==2],  pch=20 ,col="magenta") points (mom_iq[mom_work==3],  kid_score[mom_work==3],  pch=20 ,col="blue") points (mom_iq[mom_work==4],  kid_score[mom_work==4],  pch=20 ,col="black") axis (1 , c (80 ,100 ,120 ,140))

axis (2 , c (20 ,60 , 100 , 140))