ADA S23: Linear Regression 3 - Step Selection and PCA
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ADA S23: Linear Regression 3 - Step Selection and PCA
1 Introduction
Today we will focus on multifactor linear regression . In its general form multivariate regression is not dissimilar from single factor regression . We will have a model of the form
n
yi = β0 +X β˜i xk) + ✏i
k=1
Just like before we will assume ✏ ⇠ N(0, σ 2 ) . i .i .d . . Our method of choosing which coefficients are best will be the same, to minimize the residuals squared . Just like before, this will put our line through the point (y¯, x¯1 ,, x¯2 , . . . ,, x¯n ) that is,
n
yi = y¯ +X bi (xk) − k))
k=1
We will focus on one goal today: Just because we can make a multivariate model doesn’t mean we should . This problem is referred to as ’dimension reduction’ and we will look at two general procedures, step selection and principal component analysis (PCA) .
1.1 Simple Example
The S&P 500 is an index made from 500 stocks . If we used each stock included we would be able to model the S&P 500 perfectly, but let’s consider two other questions .
Question 1: If we try to pick 10 stocks to best model the S&P 500, how would we do this? How much extra do we get if we choose to use 11? This is step selection .
Question 2: If I want to Monte Carlo simulate the S&P 500 do I need to generate 500 normal random variables for each stock or can i e↵ectively do it with fewer? This is PCA .
1.2 Multicollinearity
In the case of {(yi ,x,x)} with x = x Ai 2 {1, 2 , . . . ,n} it looks like we have two predictors for y but in truth each has the same information, hence in practice we only have one . Mathematically this because an issue because both models
y = b ⇤ x(1) y = b ⇤ x(2)
will have the same residuals . In such a simple case, where one of our predictors can be fitted completely by a linear combination of others we say that our correlation matrix is not of full rank and hence can not be inverted .
Here we are concerned with the slightly di↵erent case, where one predictor and be fitted very well, but not completely, by a linear combinations of the others . In this case the correlation matrix is, in theory, invertable, however numerically it is challenging . The estimation of the inverted matrix will vary greatly with slight changes in the inputted data. To avoid this we look at dimension reduction techniques .
2 Step Selection: Not on Exam
If we have p predictors there are 2p possible linear models we can make . If p is not too large we can look at all possible models and choose the best . If p is large this is not possible and we are forced to look at another method .
Note: I am assuming this material was covered in your linear regression course . We have it here for completeness .
2.1 Forward Search
The search procedure starts with an empty subset, and at each step adds the predictor variable which has the best predictive value, e .g . , results in the largest increase in R2 . Once a variable is entered, it is not dropped .
2.2 Backward elimination
Begin with a model containing all potential explanatory variables . At each step we drop the explanatory variable with the least predictive value . The approach is computationally more cumbersome than the forward method .
2.3 Efroymson’s Method
Similar to the forward search approach except that when a new variable is entered partial correlations are consider to see if any of the variables in the model should now be dropped .
2.4 Issues with Stepwise Model Selection
• Stepwise regression should be used only for exploratory purpose or for purposes of pure prediction .
• It should not be used to hypothesis testing .
• Nominal significance level used at each step is subject to inflation .
• Automated fitting may lead to over-fitting
• A↵ected by multicollinearity
• Dummy variables are usually treated individually
This is a more general technique used often in time series .
3 Principal Component Analysis
In practice, the term Principal Component Analysis (PCA) covers first a rotation, followed by step selection on a new coordinate system .
The rotation methods are well covered in linear algebra and used in a variety of fields . In physics they are referred to as eigen vectors and eigen values . In math if something is well covered it means there are text books that can help us if we need it as opposed to having to invent a new infrastructure .
3.1 Principal Axis Analysis: Simple Example
Before discussing the rotation of data we will look rotating fixed points . In Junior High School we learn to sketch a graph of
x2 y2
9 25
However the function
5x2 +8xy +5y2 = 1
appears more complicated . Writing this in matrix form we get
5x2 +8xy +5y2 = ⇥x y ⇤ [4(5) 5(4)][ ]y(x) = XT AX
A is both symmetric and positive definite hence we can decompose it
A = SDST = " 2 2(2)# [0(1) 9(0)]"2(2) 2 #
This gives us the information that our equation can be rewritten!
5x2 +8xy +5y2 = 1 ⇣ ⌘2 +9 ⇣ ⌘2
And we can now see our function is a rotated ellipse .
Our two eigenvalues, 1 and 9 the the diagonal of the center matrix . Our two eigenvectors columns (rows) of our first (last matrix) . Notice, the length of each eigenvector is 1 and each eigenvector is orthogonal two the other eigenvectors .
3.2 Principal Component Analysis: Simple Example
For the last 250 days we see the closing price of two stocks, Bank of New York (BK) and Bank of America (BAC) . Below is a summary of the data
Table 1: Stock Data Summary
|
Min Max Mean Variance |
|||
BK BAC |
36.96 29.77 |
62.07 48.54 |
47.38 38.16 |
47.82 28.82 |
|
Correlation 96 |
6% |
With the above information we can calculate a covariance matrix
◆ =
This matrix can be decomposed
35 .88◆
◆ = ◆✓ ◆✓ ◆
Notice the sum of our two eigenvalues equals the sum of our original variances . 47 .82 + 28 .82 = 75 .44 + 1 .21 . Plotting our data as well as two lines defined by our eigenvectors,
0 .61 −0 .61
0 .79 0 .79
Another way to look at this is a portfolio that is long 0 .79 shares of BK and long 0 .61 shares of BAC will be uncorrelated from a portfolio that is short 0 .61 shares of BK and long 0 .79 shares of BAC .
We can added a third stock Game Stop, GME, which has a much lower correlation the banking sector . We see
0
(Cov(BK,GME)
Cov(BK,BAC)
Var(BAC)
Cov(BAC,GME)
1 0
35 .88
28 .82
11 .52
1
52 .94A
Decomposing this matrix we get
01 0
0 .53
−0 .31
−0 .79
1 0860(.)50 −0 .00A ( 0
0
41 .88
0
0(0) 1 0 1 .20A (0 .50
−0 .39
−0 .31
0 .87
1
−0 .00A
Notice: The first eigenvector (most of the variance) is the broad market, the second eigenvector is the banks vs game stop, the last eigenvector is the spread of BK to BAC . The second egienvalue being roughly 35% of the total volatility gives some sense of how decouple GME is from the banking sector .
3.3 Principal Component Analysis: Generalized
For our predictor variables we can create a covariance matrix . A covariance matrix, by definition, is both symmetric and positive definite . This means it can decomposed into eigenvectors and eigenvalues . Our original predictor variables can be recombined by our eigenvectors into uncorrelated predicted . Our eigenvalues will tell use the amount of variance along these axis .
Figure 1: Two Stock Prices and Their Eigen Vectors
4 Criteria for predictor selection: Not on Exam
For both PCA and step methods each time another predictor is added our R2 will improve . Hence we need some methodology to decide how many predictors should be added . Often this is more of an art than a science however something numerical criteria have been established .
4.1 Adjusted R2
The basic idea is some sort of punishment function for including too many factors in the model . One method, proposed by Mordecai Ezekiel, is
Radj = 1 − (1 − R2 )
Where p is the number of explanatory variables and n is the number of samples . Notice the fraction is the degrees of freedom of the fraction in the regular R2 .
4.2 The F-Statistic Criterion
This is the method illustration in LR lecture 2 .
4.3 The CP Criterion (Mallow’s CP )
A small value implies a precise model
CP = − ⇥ 2 − 2(k + 1)⇤
Where SSEk is the SSE based on a subset having k predictors We have
E[Ck ] = k +1
We are looking for a set of X(k) such that
• Ck is small
• Ck ⇡ k +1
Note: This is a special case of the Akaike Information Criterion that we will see later .
4.4 Akaike Information Criterion (AIC)
This is a more general technique used often in time series .
5 Outliers and Influence Points
We all know that one outlier can greatly change our estimate of a slope . Therefore it is important to be able to determine which, if any, data points are outliers . With single factor regression this is easy. With multivariate regression this is much hard . Luckily the PCA can help with this .
5.1 Mahalanobis Distance
In multi-dimensional (n) space there is more than one way to measure the distance between two points, xp and xd . The most common is the Euclidean distance .
DE = q(xp) − xd))2 +(xp) − xd))2 + ··· +(x) − x) )2
That is not, however, the only distance . For example there is the Manhattan distance .
DMn = | x p) − x d) | + | xp) − xd) | + ··· + | x) − x) |
The idea of the Mahalanobis distance is to measure the distance between a point and a cloud of data . To accomplish this we find the eigen vectors and eigen values of the could and then measure the distance of the point to the center of the could as a Euclidean distance in z-scoress . Using S as the covariance matrix of our cloud,
DM = q(xp − xd )T S − 1 (xp − xd )
P.C . Mahalanobis used this idea to measure a points distance from the center of a distribution µ. D, Mahalanobis’s definition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927 and is commonly used in clustering methods .
This distance is zero for P at the mean of D and grows as P moves away from the mean along each principal component axis . If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space . The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set .
5.2 Hoteling’s T-Squared distribution
Hoteling was a professor at Columbia for 15 years and is credited with developing PCA . Hoteling was
able to describe the distsribution associated with a Mahalanbis distance . This allows for the tests of signficance .
5.3 Useful YouTube Lectures
PCA involves rotations so video is a nice way to present the structure . Here are 3 videos that I think are useful . They are from UC Santa Cruz Video 1Video 2Video 3
References
2023-04-19