闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADA S23: Linear Regression with 2 factors

1 Introduction

Today we will look at 2-factor linear regression, mainly through an example. There two aspects we wish to examine.

• How to build the minimum squared residual line.

• Is this ﬁt statistically signiﬁcant.

In addition we will also also touch on the multifactor linear model and the problem of multicollinearity.

Note: We are quietly introducing something new. Previously we were concerned with the interaction between x and y. Now we need to untangle the interaction between our two independent variables x1 and x2 .

2 Two Factor Regression

2.1 Basic Model

We consider data with two quantitative factors X1 and X2 that we will use to predicted a third, Y. I.e. of the form (Y(i),X i),Xi)) were each (i) represents a di↵erent data point.

We will ﬁt the data to the line

Y(i) = β0 + β1 X i) + β2 Xi) + ✏ (i)

with ✏ (i) ⇠ N(0, σ 2 ) i.i.d.

2.2 Fitting the Model

To make our life easier we will consider re centered data:

x X i) − 1

x Xi) − 2

y(i) = Y(i) −

For each choice of values of b0 ,b1 ,b2 we have a sum of squares of the residuals

n n

SSE = X(y(i) − yˆ(i))2 = X(y(i) − b0 − b1 x1(i) − b2 x2(i))2 (1)

i=1 i=1

We will choose our estimators of β0 , β1 , β2 to be the numbers b0 ,b1 ,b2 which minimize equation 1. That is, solve the following.

@SSE = 0

@SSE

= 0

This produces the following equations, remember (y¯, x¯1 , x¯2 ) = (0, 0, 0)

b0 = 0

b1 = [P(xi))2][P x i)y(i)] − [P x i)xi)][P xi)y(i)]

b2 = [P(xi))2][P xi)y(i)] − [P x i)xi)][P x i)y(i)]

n n n

D = [(X(xi))2][X(xi))2] − [X xi)xi)]2

i=1 i=1 i=1

2.3 Interpreting the our formulas

point: For 1-factor regression our line passed through ( , ). For 2-factor our line passes through ( , 1 , 2 ).

slopes: For 1-factor regression we had r北y =P(P)北(北)y2 .

It is easy to check that in certain cases our new formulas reduce to our old one:

x i)y(i) = 0 ) b1 = 0,b2 = r北2 ys北2(sy)

xi)y(i) = 0 ) b1 = r北1 y s北1(sy) ,b2 = 0

x i)xi) = 0 ) b1 = r北1 y s北1(sy) ,b2 = r北2 y s北2(sy)

To understand what these conditions mean we remember:

X x i)y(i) = 0 , corr(x1 ,y) = 0

i=1

X xi)y(i) = 0 , corr(x2 ,y) = 0

i=1

X x i)xi) = 0 , corr(x1 ,x2 ) = 0

i=1

This means the extra complication is arise when both x1 anbd x2 are correlated with y but are correlated with each other as well.

3 Example: Snedecor and Cochran [G67]

3.1 Stating the problem

Our data is taken from an investigation [MEZ54] of the source from which corn plants in various soils obtain their phosphorus. The concentrations of inorganic (X1 ) and organic (X2 ) phosphorus in the soils were determined chemically. The phosphorus content Y of corn grown in the soils was also measured.

Table 1: Inorganic phosphorus X1 organic phosphorus X2 , and estimated plant-available phosphorus Y in 18 Iowa soils at 20 C. (Parts per Million)

Soil Sample	X1 X2 Y		Y - ( Y - )2
1	0.4 53 64		61.6	2.4 5.8
2	0.4 23 60		59.0	1.0 1.0
3	3.1 19 71		63.4	7.6	57.8
4	0.6 34 61		60.3	0.7 0.5
5	4.7 24 54		66.7 -12.7 161.3
6	1.7 65 77		64.9	12.1 161.3
7	9.4 44 81		76.9	4.1	146.4
8	10.1	31 93	77.0	16.0	16.8
9	11.6	29 93	79.6 13.4 256.0
10	12.6	58 51	83.8 -32.8 179.6
11	10.9	37 76	79.0 -3.0 1,075.8
12	23.1	46 96	101.6 -5.6 31.4
13	23.1	50 77	101.9 -24.9 620.0
14	21.6	44 93	98.7 -5.7 32.5
15	23.1	56 95	102.4 -7.4 54.8
16	1.9 36 54		62.8 -8.8 77.4
17	26.8	58 168	109.2 58.8 3,457.4
18	29.9	51 99	114.2 -15.2 231.0
Sum	215.0 758 1,463		1,463.0 0.0 6,414
Mean	11.94 42.11 81.28
Standard Deviation	10.15 13.62 27.0		18.73 19.42
Correlations	X1 ,X2 = 46% X1 ,Y = 69% X2 ,Y = 35%

3.2 Fitting the line

Our formulas produce the equation

Y = 56.26 + 1.79X1 +0.087X2

. With no phosphorous in the soil we can expect the corn to have 56.26 ppm.

On average we see a gain of:

1.79 * 11.94 = 21.37ppm from inorganic phosphorous.

0.87 * 42.11 = 3.67ppm from organic phosphorous.

If we ﬁt a 1-factor model using X1 we would have the following:

Y = 59.26 + 1.84X1

. The slope changes because the 1.79 represents how much we expected the Y to increase when X1 increases by 1 and X2 is held constant. 1.84 represents how much we expect Y to increase as X1 is increased by 1 and X2 is allowed to ﬂoat. As X2 is positively correlated with X1 we would expected X2 to ﬂoat higher. As X2 is positively correlated with Y we would a large increase in Y then if X2 was held constant.

3.3 Model Selection: Testing for signiﬁcance

Our goal is to predict Y. We recorded X1 and X2 with the hope that they can help us predict Y. This does not mean either (or both) of them actually help in a statistically signiﬁcant way.

Consider four possible linear ﬁts

Name	Fit	P ei(2)
Model 1	y = 0	12,390
Model 2	y = 0 + β˜1 x1 = 1.843x1	6,433
Model 3	y = 0 + β˜2 x2 = 0.702x2	10,833
Model 4	y = 0 + β1 x1 + β2 x2 = 1.79x1 +0.087x2	6,414

Clearly Model 2 is better than Model 3. Both use one explanatory variable and model 2 has smaller residuals. As we move from Model 2 to Model 4 we see a slight improvement in ﬁt but we have added a second explanatory variable. Is this improvement statistically signiﬁcant? The same question can be asked about moving from Model 1 to Model 2 or 4. We will look at two di↵erent approaches to the same math to check this.

Note: These methods rely on the assumption that ✏i ⇠ N(0, σ 2 ). This is the ﬁrst time we are using this constraint.

3.3.1 ANOVA Method

We begin by testing Model 4 versus Model 1. Model 1 vs Model 2 or Model 3 would be the technique from Linear Regression Lecture 1. For this we preform an ANOVa. We are essentially calculating the ratio of the standard error of the residuals of the two models.

Source of Variation

Degrees of Freedom Sum of Squares Mean Square F

Regression

Residuals

Total

P yˆ2 = 5, 975.6

P e2 = 6, 414.0

P y2 = 12, 389.6

2,987.8

427.6

728.8

F 2,9878.9 / 427.6 = 6.99

From our F-table we can read F5(%) = 6.36. As our observed F is larger than this value we can reject model 1. Model 4 is signiﬁcantly better than model 1.