ADA Spring 2023: Linear Regression 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ADA Spring 2023: Linear Regression 1
January 1, 2023
For today’s lecture we will focus on data with the form, (xi ,yi ) with x a quantitative factor and y a quantitative factor. Our goal would be to know if the x influences the y. We restrict ourselves to models of the form
yi = µ + β1 (xi − )+ ✏i (1)
Where i indexes our data points. We also assume,
✏ ⇠ N(0, σ 2 ). (2)
with independence both between the index and the ✏ value. We consider our xi s to be fixed. Only the yi are ’random’. Our goals will be
• Given our data how do we estimate µ , β1 and σ
• Given our estimated line, can we build a confidence interval for a prediction?
• Is the e↵ect of X on Y statistically significant, i.e. can we say with confidence β1 0
• Can we build a confidence interval for σ?
• How do we handle non normal residuals?
• (New) Alternative measures of correlation.
• (New) Robust Regression.
2 Fitting our line
See posted notes LR Review 1 and LR Review 2.
2.1 Gauss-Markov Theorem: Not on exam
In the statistical inference you learned about various types of estimators. We have chosen estimators that minimize the sum of the residuals squared. How do we justify this? Under the condition 2 we
have the Gauss-Markov Theorem.
Gauss (d. 1855) was able to show that under the conditions of the residuals being independent, normally distributed, each with the same variance, then our ordinary least squares estimator has the lowest sampling variance of all linear unbiased estimators.
Markov (b. 1856) was able to loosen these conditions. He showed that all that was needed was uncorrelated, with mean zero and finite homeoscedastic variance.
3 Confidence Intervals and Linear Regression
To build a confidence interval for our estimators and for our predictions we need to have some under- standing how our data is generated. For this we assume that both equations 1 and 2 are true where the only uncertainty is in the accuracy of our estimates and future random noise, ✏n+1
3.1 Results From Inference Class
In inference we learned that equation 1 and equation2 produce the following results.
=y¯ ⇠ N(0, σ 2 /n)
βˆ1 =r ⇠ N(β , )
2 = ei(2) ⇠ χn(2)− 1
We also learned of the independence of these 3 estimators.
Note: Since the true value of σ is unknown we replace it with the estimate se , which has a χ2 distribution. This causes our other two estimators to switch from z to t distribution.
3.2 The noise
To build any confidence interval we must estimate σ from equation 2. We use our observed residuals.
n
3.3 The point
We assume our value of is exact and does not vary from experiment to experiment. The observed y¯ is assumed to be random and vary day to day. As this number is estimated from a given day’s data we would like to build a confidence interval for it as well.
Without preforming linear regression our confidence interval for the center of n observations would be
y¯ ± tn−1
With the linear regression we can replace sy with the smaller se and we get
y¯ ± tn−2
3.4 The slope
We will test the null hypothesis that
H0 : β1 = 0 Ai Ha : β1 0
Using our results from inference class this can be done with a simple t-test
^n − 1(b1 − 0)
sx /se
3.4.1 Interpretation
We will never have data that gives us a slope of exactly 0. Hence to say X has a linear e↵ect on Y we would like to have our slope statistically significantly away from 0. The above test provides that. In the case we reject the null our line is statistically significant. If we fail to reject we do not have enough evidence to say that X e↵ects Y.
3.5 Confidence intervals for a prediction
With the variance of each of our estimators we are able to build confidence intervals for predictions. We can ask our line to predict a new yn+1 for a given xn+1 or the mean E(yn+1 | xn+1). The latter will have a smaller confidence interval.
yˆn+1 ± tn−2 sse(2) + + (xn+1 − )2
and
yˆn+1 ± tn−2 s + (xn+1 − )2
Both formulas have the satisfying form of our interval becoming wider as our prediction, xn+1, moves further from our data, .
3.6 Reverse Prediction : Not on Exam
Given yn+1 we can also predict n+1 . Under our assumptions the MLE would be
n+1 = + (yn+1 − y¯)
and we can find
var(n+1) = h1+ + (xn+1 − )2]
4 Violations of assumptions
To detect departures from our assumptions we often:
• Look at scatter plot of the data
• compute R2
• plot the residuals versus x
• test for normality of our residuals
In the case our assumptions are violated the easiest corrective measure is a transformation of our data. We will revisit this later. Non parametric methods exist as well.
The second easiest corrective measure is to add other predictors in our model. Again we will revisit this another day.
4.1 Non-normality
The impact of non-normal residuals that our confidence intervals will not be reliable. Robust regression methods can help here.
4.2 Correlated Errors
The impact of correlated residuals that our confidence intervals will not be reliable. Time series methods can help here. The Durbin-Watson test can check test for correlation between sequential residuals.
ei = pei+1 + ⌫i
with ⌫ ⇠ N(0, σ v(2)) and then we test for H0 : p = 0. In the event of correlation the Cochrane-Orcutt procedure or generalized estimating equations can help.
4.3 Heteroskedasticity: Unequal Variances
The simple answer to this is the minimize the sum of our weighted residuals. That is
n
X wi ei(2)
i=1
The trick is finding the proper set of {wi } . If we knew var(ei ) = σi(2) then we would set
wi =
Hence we can assume some sort of volatility structure, σ = σ(xi ) either of closed form, observed from our residuals or from a 2nd predictor.
5 Alternative Measures of Correlation: Not on exam
To build our linear regression model we need the correlation of our data. The standard method of measuring correlation is Pearson’s. Just as we have alternative measures of center and spread we also have alternative measures of correlation.
5.1 Pearson’s product moment Correlation
In probability the defination of correlation is the following.
p = E h ]
−1 < p < 1
A common estimate for the above is Pearson’s method.
r = ⇣ ⌘⇣ ⌘
Under our model assumptions we can produce the below test statistic
T = rr
which has a tn−2 distribution under the null. The more general formula for testing if our correlation is p0 is below
The more general hypothesis H0 : p = p0 uses a Fisher’s Z-transformation.
1 1+ r
2 1 − r
Z = [Q(r) − Q(p)] ^(n − 3)
Which has a standard normal distribution.
Note: The problem with Pearson’s correlation coefficient is outliers can push the value around. The most famous example is two data points with produce a correlation of 1. While 2 data points is a silly example, two clusters of data points, or one cluster and an outlier is not. To address this we introduce more ranked based metrics.
5.2 Spearman’s Rank Correlation Coefficient
Here we ware less concerned with when X changes by 6 that y changes by β1 6 than if when x increases what y will also increase . To reflect this we replace our data points with the rank of our data points . Denoting di is the di↵erence of rank xi (relative to the x’s) and yi (relative to the y’s) we define Spearman’s rank corrlation coefficent to be
rs = 1 −
If there are no ties in rank than rs is the Pearson’s correlation of the ranks, not of the raw data .
5.3 Kendall’s Tau
Kendall’s Tau is anther rank-based measure of association . Let Ri and Si be the ranks of Xi and Yi respectively. Then
T = sgn(Ri − Rj )(Si − Sj )
with
81
> ,
sgn(x) = <0 , −1,
if x > 0
if x = 0
X < 0
Kendall’s tends to be more robust than Spearman’s and is used when the sample size is smaller .
6 Robust Regression Methods: Not on exam
Minimizing least squares punishes the line severally for large residuals . If. due to whatever reason, we don’t want our line determined by outliers we can choose a more linear estimate of error . While this is hard to solve with calculus computers can do this work for us . There is more than one way to do define this .
6.1 Least Absolute Deviation
Here we find the estimator that minimizes, w .r .t β
n
X | ✏i (β) |
i=1
Note:
• The sum of the residuals might not be 0
• Need numerical methods to solve
6.2 Least Median of Square Regressions
Previously we tried to avoid outliers by not squaring our residuals . Here we avoid it by minimizing the median as opposed to the mean .
median {✏i(2)(β),i = 1 , 2 , . . . ,n
• This method has a high breakdown point (maybe 50%)
• Need complex numerical methods to solve
6.3 Trimmed Regression
Here we throw out some out our outliers. We minimize
q
X ei (β)2
i=1
where q < n. This is computationally intensive.
6.4 M-Estimates
Here we will replace our function ( · )2 or | · | with a more general function p(·). And a minimizing
function n
p ⇣ ⌘
Previously we tried to avoid outliers by not squaring our residuals. Here we avoid it by minimizing the median as opposed to the mean.
median {ei(2)(β),i = 1, 2, . . . ,n}
Huber Loss Function combines our L1 and L2 loss functions.
p(x) = ( |2(北)
if | x |< k
otherwise
2023-04-19