Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADA Spring 2023: Linear Regression 1

January 1, 2023

1    Introduction

For today’s lecture we will focus on data with the form, (xi ,yi ) with x a quantitative factor and y a quantitative factor. Our goal would be to know if the x influences the y. We restrict ourselves to models of the form

yi  = µ + β1 (xi − )+ ✏i                                                                                       (1)

Where i indexes our data points. We also assume,

✏ ⇠ N(0, σ 2 ).                                     (2)

with independence both between the index and the  value. We consider our xi s to be fixed. Only the yi  are ’random’. Our goals will be

• Given our data how do we estimate µ , β1  and σ

• Given our estimated line, can we build a confidence interval for a prediction?

• Is the e↵ect of X on Y statistically significant, i.e. can we say with confidence β1   0

 Can we build a confidence interval for σ?

How do we handle non normal residuals?

•  (New) Alternative measures of correlation.

•  (New) Robust Regression.

2    Fitting our line

See posted notes LR Review 1 and LR Review 2.

2.1    Gauss-Markov Theorem:  Not on exam

In the statistical inference you learned about various types of estimators. We have chosen estimators that minimize the sum of the residuals squared. How do we justify this? Under the condition 2 we

have the Gauss-Markov Theorem.

Gauss (d. 1855) was able to show that under the conditions of the residuals being independent, normally distributed, each with the same variance, then our ordinary least squares estimator has the lowest sampling variance of all linear unbiased estimators.

Markov (b. 1856) was able to loosen these conditions. He showed that all that was needed was uncorrelated, with mean zero and finite homeoscedastic variance.

3    Condence Intervals and Linear Regression

To build a confidence interval for our estimators and for our predictions we need to have some under- standing how our data is generated. For this we assume that both equations 1 and 2 are true where the only uncertainty is in the accuracy of our estimates and future random noise, ✏n+1

3.1    Results From Inference Class

In inference we learned that equation 1 and equation2 produce the following results.

 =y¯ ⇠ N(0, σ 2 /n)

βˆ1  =r   N(β , )

2  =  ei(2)  ⇠ χn(2)1

We also learned of the independence of these 3 estimators.

Note: Since the true value of σ is unknown we replace it with the estimate se , which has a χ2 distribution. This causes our other two estimators to switch from z to t distribution.

3.2    The noise

To build any confidence interval we must estimate σ from equation 2. We use our observed residuals.

n

3.3    The point

We assume our value of  is exact and does not vary from experiment to experiment. The observed y¯ is assumed to be random and vary day to day. As this number is estimated from a given day’s data we would like to build a confidence interval for it as well.

Without preforming linear regression our confidence interval for the center of n observations would be

y¯ ± tn 

With the linear regression we can replace sy  with the smaller se  and we get

y¯ ± tn 

3.4    The slope

We will test the null hypothesis that

H0  : β1  = 0  Ai      Ha  : β1   0

Using our results from inference class this can be done with a simple t-test

^n − 1(b1 − 0)

sx /se

3.4.1    Interpretation

We will never have data that gives us a slope of exactly 0. Hence to say X has a linear e↵ect on Y we would like to have our slope statistically significantly away from 0. The above test provides that. In the case we reject the null our line is statistically significant. If we fail to reject we do not have enough evidence to say that X e↵ects Y.

3.5    Condence intervals for a prediction

With the variance of each of our estimators we are able to build confidence intervals for predictions. We can ask our line to predict a new yn+1  for a given xn+1  or the mean E(yn+1  | xn+1). The latter will have a smaller confidence interval.

yˆn+1 ± tn2 sse(2) +  +   (xn+1 )2

and

yˆn+1 ± tn2 s  +   (xn+1 )2

Both formulas have the satisfying form of our interval becoming wider as our prediction, xn+1, moves further from our data,  .

3.6    Reverse Prediction :  Not on Exam

Given yn+1  we can also predict n+1 . Under our assumptions the MLE would be

n+1  =  +  (yn+1 y¯)

and we can nd

var(n+1) =  h1+  +  (xn+1 )2]

4    Violations of assumptions

To detect departures from our assumptions we often:

• Look at scatter plot of the data

• compute R2

• plot the residuals versus x

• test for normality of our residuals

In the case our assumptions are violated the easiest corrective measure is a transformation of our data. We will revisit this later. Non parametric methods exist as well.

The second easiest corrective measure is to add other predictors in our model. Again we will revisit this another day.

4.1    Non-normality

The impact of non-normal residuals that our confidence intervals will not be reliable. Robust regression methods can help here.

4.2    Correlated Errors

The impact of correlated residuals that our confidence intervals will not be reliable. Time series methods can help here. The Durbin-Watson test can check test for correlation between sequential residuals.

ei  = pei+1 + ⌫i

with ⌫ ⇠ N(0, σ v(2)) and then we test for H0  : p = 0. In the event of correlation the Cochrane-Orcutt procedure or generalized estimating equations can help.

4.3    Heteroskedasticity:  Unequal Variances

The simple answer to this is the minimize the sum of our weighted residuals. That is

n

X wi ei(2)

i=1

The trick is nding the proper set of {wi } . If we knew var(ei ) = σi(2)  then we would set

wi  =

Hence we can assume some sort of volatility structure, σ = σ(xi ) either of closed form, observed from our residuals or from a 2nd predictor.

5    Alternative Measures of Correlation: Not on exam

To build our linear regression model we need the correlation of our data.  The standard method of measuring correlation is Pearson’s. Just as we have alternative measures of center and spread we also have alternative measures of correlation.

5.1    Pearsons product moment Correlation

In probability the defination of correlation is the following.

p = E h  ]

−1 < p < 1

A common estimate for the above is Pearsons method.

r =   ⌘⇣

Under our model assumptions we can produce the below test statistic

T = rr  

which has a tn2  distribution under the null.  The more general formula for testing if our correlation is p0  is below

The more general hypothesis H0  : p = p0  uses a Fisher’s Z-transformation.

1     1+ r

2    1 − r

Z = [Q(r) − Q(p)] ^(n 3)

Which has a standard normal distribution.

Note:  The problem with Pearson’s correlation coefficient is outliers can push the value around. The most famous example is two data points with produce a correlation of 1. While 2 data points is a silly example, two clusters of data points, or one cluster and an outlier is not.  To address this we introduce more ranked based metrics.

5.2    Spearmans Rank Correlation Coecient

Here we ware less concerned with when X changes by 6 that y changes by β1 6 than if when x increases what y will also increase .  To reflect this we replace our data points with the rank of our data points . Denoting  di  is  the  di↵erence  of  rank  xi   (relative  to  the  x’s)  and  yi   (relative  to  the  y’s)  we  define Spearman’s rank corrlation coefficent to be

rs  = 1

If there are no ties in rank than rs  is the Pearsons correlation of the ranks, not of the raw data .

5.3    Kendalls Tau

Kendall’s Tau is anther rank-based measure of association .  Let Ri  and Si  be the ranks of Xi  and Yi respectively.  Then

T =   sgn(Ri Rj )(Si Sj )

with

81

>       ,

sgn(x) =  <0 , 1,

if x > 0

if x = 0

X < 0

Kendall’s tends to be more robust than Spearman’s and is used when the sample size is smaller .

6    Robust Regression Methods: Not on exam

Minimizing least squares punishes the line severally for large residuals .  If.  due to whatever reason, we don’t want our line determined by outliers we can choose a more linear estimate of error .  While this is hard to solve with calculus computers can do this work for us .  There is more than one way to do define this .

6.1    Least Absolute Deviation

Here we nd the estimator that minimizes, w .r .t β

n

X | i (β) |

i=1

Note:

• The sum of the residuals might not be 0

Need numerical methods to solve

6.2    Least Median of Square Regressions

Previously we tried to  avoid outliers by not squaring our residuals .   Here we  avoid  it by minimizing the median as opposed to the mean .

median     {i(2)(β),i = 1 , 2 , . . . ,n

• This method has a high breakdown point  (maybe 50%)

• Need complex numerical methods to solve

6.3    Trimmed Regression

Here we throw out some out our outliers. We minimize

q

X ei (β)2

i=1

where q < n. This is computationally intensive.

6.4    M-Estimates

Here we will replace our function ( · )2  or | · | with a more general function p(·). And a minimizing

function                                      n

 p  

Previously we tried to avoid outliers by not squaring our residuals. Here we avoid it by minimizing the median as opposed to the mean.

median  {ei(2)(β),i = 1, 2, . . . ,n}

Huber Loss Function combines our L1  and L2  loss functions.

p(x) = ( |2(北)

if | x |< k

otherwise