Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADA S23: Linear Regression with 2 factors

1 Introduction

Today we will look at 2-factor linear regression, mainly through an example. There two aspects we wish to examine.

• How to build the minimum squared residual line.

• Is this fit statistically significant.

In addition we will also also touch on the multifactor linear model and the problem of multicollinearity.

Note: We are quietly introducing something new. Previously we were concerned with the interaction between x and y. Now we need to untangle the interaction between our two independent variables x1 and x2 .

2 Two Factor Regression

2.1 Basic Model

We consider data with two quantitative factors X1  and X2  that we will use to predicted a third, Y. I.e. of the form (Y(i),X i),Xi)) were each (i) represents a di↵erent data point.

We will fit the data to the line

Y(i)  = β0 + β1 X i) + β2 Xi) + ✏ (i)

with (i) N(0, σ 2 ) i.i.d.

2.2 Fitting the Model

To make our life easier we will consider re centered data:

x X i) 1

x Xi) 2

y(i)  = Y(i)

For each choice of values of b0 ,b1 ,b2  we have a sum of squares of the residuals

n                                          n

SSE = X(y(i) yˆ(i))2  = X(y(i) b0 b1 x1(i) b2 x2(i))2                                                 (1)

i=1                                      i=1

We will choose our estimators of β0 , β1 , β2  to be the numbers b0 ,b1 ,b2  which minimize equation 1. That is, solve the following.

@SSE = 0

0

@SSE = 0

1

@SSE

= 0

2

This produces the following equations, remember (y¯, x¯1 , x¯2 ) = (0, 0, 0)

b0  = 0

b1  = [P(xi))2][P x i)y(i)] − [P x i)xi)][P xi)y(i)]

b2  = [P(xi))2][P xi)y(i)] − [P x i)xi)][P x i)y(i)]

n                      n                            n

D = [(X(xi))2][X(xi))2] − [X xi)xi)]2

i=1                   i=1                          i=1

2.3 Interpreting the our formulas

point: For 1-factor regression our line passed through ( , ). For 2-factor our line passes through ( , 1 , 2 ).

slopes: For 1-factor regression we had ry =P(P)北(北)y2 .

It is easy to check that in certain cases our new formulas reduce to our old one:

x i)y(i)  = 0 ) b1  = 0,b2  = r2 ys北2(sy)

xi)y(i)  = 0 ) b1  = r1 y s北1(sy) ,b2  = 0

x i)xi)  = 0 ) b1  = r1 y s北1(sy) ,b2  = r2 y s北2(sy)

To understand what these conditions mean we remember:

n

X x i)y(i)  = 0 , corr(x1 ,y) = 0

i=1

n

X xi)y(i)  = 0 , corr(x2 ,y) = 0

i=1

n

X x i)xi)  = 0 , corr(x1 ,x2 ) = 0

i=1

This means the extra complication is arise when both x1   anbd x2   are correlated with y but are correlated with each other as well.

3 Example: Snedecor and Cochran [G67]

3.1 Stating the problem

Our data is taken from an investigation [MEZ54] of the source from which corn plants in various soils obtain their phosphorus. The concentrations of inorganic (X1 ) and organic (X2 ) phosphorus in the soils were determined chemically. The phosphorus content Y of corn grown in the soils was also measured.

Table 1: Inorganic phosphorus X1  organic phosphorus X2 , and estimated plant-available phosphorus Y in 18 Iowa soils at 20 C. (Parts per Million)

Soil Sample

X1 X2 Y

Y -   ( Y - )2

1

0.4     53     64

61.6

2.4   5.8

2

0.4     23     60

59.0

1.0       1.0

3

3.1     19     71

63.4

7.6

57.8

4

0.6     34     61

60.3

0.7      0.5

5

4.7     24     54

66.7   -12.7     161.3

6

1.7     65     77

64.9

12.1     161.3

7

9.4     44     81

76.9

4.1

146.4

8

10.1

31     93

77.0

16.0

16.8

9

11.6

29     93

79.6    13.4     256.0

10

12.6

58     51

83.8   -32.8     179.6

11

10.9

37     76

79.0    -3.0    1,075.8

12

23.1

46     96

101.6    -5.6      31.4

13

23.1

50     77

101.9   -24.9    620.0

14

21.6

44     93

98.7    -5.7   32.5

15

23.1

56     95

102.4    -7.4      54.8

16

1.9     36     54

62.8    -8.8      77.4

17

26.8

58     168

109.2 58.8    3,457.4

18

29.9

51     99

114.2   -15.2     231.0

Sum

215.0   758   1,463

1,463.0   0.0     6,414

Mean

11.94   42.11   81.28

Standard Deviation

10.15   13.62   27.0

18.73   19.42

Correlations

X1 ,X2 = 46% X1 ,Y = 69% X2 ,Y = 35%

3.2 Fitting the line

Our formulas produce the equation

Y = 56.26 + 1.79X1 +0.087X2

. With no phosphorous in the soil we can expect the corn to have 56.26 ppm.

On average we see a gain of:

1.79 * 11.94 = 21.37ppm from inorganic phosphorous.

0.87 * 42.11 = 3.67ppm from organic phosphorous.

If we t a 1-factor model using X1  we would have the following:

Y = 59.26 + 1.84X1

. The slope changes because the 1.79 represents how much we expected the Y to increase when X1 increases by 1 and X2  is held constant. 1.84 represents how much we expect Y to increase as X1  is increased by 1 and X2  is allowed to oat. As X2  is positively correlated with X1  we would expected X2  to oat higher. As X2  is positively correlated with Y we would a large increase in Y then if X2 was held constant.

3.3 Model Selection: Testing for signicance

Our goal is to predict Y. We recorded X1  and X2 with the hope that they can help us predict Y. This does not mean either (or both) of them actually help in a statistically significant way.

Consider four possible linear fits

Name

Fit

P ei(2)

Model 1

y = 0

12,390

Model 2

y = 0 + β˜1 x1 = 1.843x1

6,433

Model 3

y = 0 + β˜2 x2 = 0.702x2

10,833

Model 4

y = 0 + β1 x1 + β2 x2 = 1.79x1 +0.087x2

6,414

Clearly Model 2 is better than Model 3. Both use one explanatory variable and model 2 has smaller residuals. As we move from Model 2 to Model 4 we see a slight improvement in fit but we have added a second explanatory variable. Is this improvement statistically significant? The same question can be asked about moving from Model 1 to Model 2 or 4. We will look at two di↵erent approaches to the same math to check this.

Note: These methods rely on the assumption that ✏i  ⇠ N(0, σ 2 ). This is the first time we are using this constraint.

3.3.1 ANOVA Method

We begin by testing Model 4 versus Model 1. Model 1 vs Model 2 or Model 3 would be the technique from Linear Regression Lecture 1. For this we preform an ANOVa. We are essentially calculating the ratio of the standard error of the residuals of the two models.

Source of Variation

Degrees of Freedom Sum of Squares Mean Square F

Regression

Residuals

Total

2

15

17

P 2 = 5, 975.6

P e2 = 6, 414.0

P y2 = 12, 389.6

2,987.8

427.6

728.8

F 2,9878.9 / 427.6 = 6.99

From our F-table we can read F5(%)  = 6.36. As our observed F is larger than this value we can reject model 1. Model 4 is significantly better than model 1.

First we compare model 2 to model 4. As there is only a slight reduction in the residuals this is interesting. Again, we are essentially comparing