STAT 3022 Multiple Linear Regression Models - Part 3
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Multiple Linear Regression Models - Part 3
Multicollinearity, Polynomial, Qualitative variables
STAT 3022
Applied linear models
Multicollinearity in MLR
Overview
• A distinct characteristic of MLR compared to SLR is that they
have more than one predictor. As such, the intercorrelation
between predictors play important roles in the estimated
coefficients as well as inference of the MLR.
• Such intercorrelation is known as multicollinearity (multi:
many; collinear: linear dependence).
• We will study three cases:
1. When all predictors are uncorrelated.
2. When all predictors are perfectly correlated.
3. When all predictors are correlated but not perfectly correlated.
• In this section, we denote rjk as the sample correlation
between two predictors Xj and Xk.
Uncorrelated predictors
Consider the models
yi = β0 + β1xi1 + β2xi2 + εi (1)
yi = β0 + β1xi1 + εi (2)
yi = β0 + β2xi2 + εi (3)
If r12 = 0 (i.e X1 and X2 are uncorrelated), then
• The OLS estimates for β1 of model (1) and model (2) are
exactly the same.
• The OLS estimates for β2 of model (1) and model (3) are
exactly the same.
• SSR(X1, X2) = SSR(X1) + SSR(X2)
An example: Kutner et al. (Table 7.6)
Example: effect of work crew size (X1) and level of bonus pay
(X2) on crew productivity (Y). X1 and X2 are uncorrelated.
An example: Kutner et al. (Table 7.6)
Uncorrelated predictors
In general, if all p− 1 predictors are mutually uncorrelated:
• The effect of one predictor on the response does not depend
on whether these other predictors are in the model.
• Hence, we can get the effect of one predictor Xj on the
response Y just by fitting SLR of Xj and Y .
• We do not go into the math of this conclusion, but intuitively,
when all the predictors are uncorrelated, they have “separate”
effects on the response.
• You will see this case again when we talk about experimental
designs.
Perfectly correlated predictors
The second (extreme) case is when one or some predictors are
perfectly correlated with one another.
• Essentially, that just means one predictor can be written as the
linear combination of some other predictor variables. In this
case, the design matrix X is not full-ranked, i.e rank(X) < p.
• Recall the normal equation for OLS:
X>Xb = X> y
and rank(X>X) = rank(X). Hence, in this case, the matrix
X>X is also not full-ranked, and we will have infinite
number of solutions for b.
An example: Kutner et al. (Table 7.7)
An example: Kutner et al. (Table 7.7)
Normal equations: 4 26 3326 204 232
33 232 281
b0b1
b2
=
2722118
2419
.
The fitted value yˆ4 is
yˆ4 =
[
1 10 10
]b0b1
b2
= [−0.4 0.1 0]
4 26 3326 204 232
33 232 281
b0b1
b2
=
[
−0.4 0.1 0
] 2722118
2419
= 103.
Perfectly correlated predictors
• Though we have infinitely number of solutions for b, all
solutions give the same fitted values (and residuals).
• Therefore, while there is no interpretation for b, the model
can still provide a good fit for the data.
Highly correlated predictors
Although these above cases are extreme, in reality, it is very
common to find many predictors are highly correlated. At the end,
highly correlated variables are inherent characteristics of the
population of interest.
• Example: Regression of food expenditures on income, savings,
age of head of household, educational level, etc., all the
predictors are correlated with one another.
• Mathematically, although the design matrix X and the matrix
X>X still have the full rank, the inversion V = (X>X)−1
become unstable.
• Recall that Var(βˆ) = σ2V, so multicollinearity inflates the
variance of the OLS estimator.
Body fat example (n=20)
Corr:
0.924***
Corr:
0.458*
Corr:
0.085
Corr:
0.843***
Corr:
0.878***
Corr:
0.142
X1 X2 X3 y
X1
X2
X3
y
15 20 25 30 45 50 55 25 30 35 15 20 25
0.00
0.02
0.04
0.06
45
50
55
25
30
35
15
20
25
Effect on sum of squares
Model SSE
y ∼ X1 143.12
y ∼ X2 113.42
y ∼ X1 +X2 109.95
y ∼ X1 +X2 +X3 98.40
Since X1 and X2 are highly correlated, the model including both
X1 and X2 reduces SSE very little compared to the SSE of the
model that has only X2.
Effect on regression coefficients
Model βˆ1 se(βˆ1) t1 βˆ2 se(βˆ2) t2
y ∼ X1 0.86 0.13 6.66*
y ∼ X2 0.86 0.11 7.78*
y ∼ X1 +X2 0.22 0.30 0.73 0.66 0.29 2.26*
y ∼ X1 +X2 +X3 4.34 3.01 1.44 -2.86 2.58 -1.37
*: t-test of β1 = 0 or β2 = 0 are significant at α = 0.05
• When predictors are highly correlated, regression parameters
are difficult to interpret. When new variables are added, the
OLS estimate can change dramatically.
Effect on regression coefficients
Now consider the last model y ∼ X1 +X2 +X3. We see from the
last table that none of the t-test for β1 = 0 nor β2 = 0 is
significant. However, if we conduct the F -test:
H0 : β1 = β2 = 0
H1 : at least one of the β1 and β2 is non-zero
Under H0, the model is y ∼ X3, so SSE(H0) = 485.34 and
df(H0) = 18. Hence, the corresponding F statistic is
F =
(485.34− 98.40)/(18− 16)
98.40/16
= 31.46
with p-value = 0.00. Hence, we reject H0 and conclude at least at
least one of the β1 and β2 is non-zero.
Effect on regression coefficients
Q: Are the results of F -test and t-tests contradictory?
A: No, because they test two different things.
• The t-test for β1 = 0 assume X2 already included in the
model, and similarly, the t-test for β2 = 0 assume X1 already
included in the model.
• On the other hand, the F -test essentially check whether at
least one of X1 and X2 has to be in the model.
• Such seemingly contradictory results between marginal t and
F -tests are common when the predictors are highly correlated,
so do not just look at the p-values of the t-tests in the
summary table and decide which variables should be kept.
Implications
• Regression parameters are difficult to interpret.
- While it is mathematically true to say, ’when all the other
covariates hold fixed, a change of one unit in X1 leads to a
change of βˆ1 unit in the outcome,’ you are unable to change
X1 and keep other covariate fixed in practice.
- Possibly the only exception is when we conduct experiments
and are able to completely control for the values of the
covariates.
• Multicollinearity is the issue mostly for X, and it does not
seriously affect the fitted value yˆ of the model.
Implications
• When p is large, multicollinearity tend to become more
serious, making variable and model selection more important.
• Variable selection and model evaluation have been covered in
DATA2x02:
- Information criterion: AIC, BIC
- Backward and forward selection
- Prediction performance: in-sample (R2) and out-of-sample
prediction (cross-validation)
• These above topics will not be tested in the quiz/final exam,
but you are expected to do it in the project.
Polynomial regression
Polynomial regression
First, we consider the following models with the following forms:
yi = β0 + β1xi + β2x
2
i + . . .+ βp−1x
p−1
i + εi, i = 1, . . . , n
There is only one outcome Y and one predictor X; however, unlike
the simple linear regression, the regression model is a polynomial
of degree p− 1 with respect to X.
Fitting that model is essentially the same as fitting a MLR model
with p− 1 predictors and design matrix
X1 = X; X2 = X
2; . . . , Xp−1 = Xp−1,
Polynomial regression
The design matrix is
X =
1 x1 x
2
1 . . . x
p−1
1
1 x2 x
2
2 . . . x
p−1
2
...
...
...
1 xn x
2
n . . . x
p−1
n
and the OLS estimate for β is the same as in regular MLR:
βˆ =
(
X>X
)−1
X> y
and then all other inferences can be done as previously.
Notes for polynomial regression
The polynomial terms are highly correlated.
• Example: Car data, Y = distance; X = speed.
speed (Y ) dist (X1 = X) X2 = X
2 X3 = X
3
4 2 4 8
4 10 100 1000
...
...
...
...
25 85 7225 614125
• Correlation matrix among polynomial terms:
rXX =
1.00 0.96 0.870.96 1.00 0.97
0.87 0.97 1.00
Polynomial regression
A common step to reduce multicollinearity in polynomial regression
is to center the data before taking higher degree.
dist (X1 = X) X
∗
1 = X1 − x¯ X∗2 = (X∗1 )2 X∗3 = (X∗1 )3
2 -41 (−41)2 (−41)3
10 -33 (−33)2 (−33)3
...
...
...
...
85 42 (42)2 (42)3
X¯ = 42.98
rX∗X∗ =
1.00 0.52 0.750.52 1.00 0.83
0.75 0.83 1.00
, rXX =
1.00 0.96 0.870.96 1.00 0.97
0.87 0.97 1.00
Polynomial regression
• Centering the data reduces multicollinearity, but not removes
it completely.
• One possible way to reduce multicollinearity completely in
polynomial regression is through the use of orthogonal
polynomials.
- The math can be read more in the Wikipedia page.
- In R, the set of orthogonal polynomials for a vector x up to
degree k can be created by the function poly(x, k).
• The main disadvantage of using orthogonal polynomials is the
model is not so easy to interpret, since they transform the
covariate for mathematical convenience without much
meaning.
Polynomial regression
Another important aspect of polynomial regression is hierarchy:
• First, it is essential that a higher order term should only be
added sequentially. For example, do NOT add the cubic term
(X3) unless the quadratic term (X2) is already in the model.
Eg: Do NOT fit the models:
yi = β0 + β1xi + β3x
3
i + εi
yi = β0 + β1x
2
i + β3x
3
i + εi
The reason is the lower-order term provides more basic
information about the regression function, while the
higher-order term only provides a refinement of that.
Polynomial regression
• On the other hand, if a polynomial term of a given degree is
retained, then all related terms of lower degree are also
retained in the model.
Eg: If you fit the cubic model
yi = β0 + β1xi + β2x
2
i + β3x
3
i + εi,
and β3 is significant, never drop the linear (xi) and the
quadratic term (x2i ) regardless of whether β1 nor β2 is
significant or not.
Otherwise, continue to test whether β2 is significant. If it is,
never drop the linear term xi regardless of whether β1 is
significant.
2026-03-09