闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADA S23: Linear Regression 3 - Step Selection and PCA

1 Introduction

Today we will focus on multifactor linear regression . In its general form multivariate regression is not dissimilar from single factor regression . We will have a model of the form

yi = β0 +X β˜i xk) + ✏i

k=1

Just like before we will assume ✏ ⇠ N(0, σ 2 ) . i .i .d . . Our method of choosing which coeﬃcients are best will be the same, to minimize the residuals squared . Just like before, this will put our line through the point (y¯, x¯1 ,, x¯2 , . . . ,, x¯n ) that is,

yi = y¯ +X bi (xk) − k))

k=1

We will focus on one goal today: Just because we can make a multivariate model doesn’t mean we should . This problem is referred to as ’dimension reduction’ and we will look at two general procedures, step selection and principal component analysis (PCA) .

1.1 Simple Example

The S&P 500 is an index made from 500 stocks . If we used each stock included we would be able to model the S&P 500 perfectly, but let’s consider two other questions .

Question 1: If we try to pick 10 stocks to best model the S&P 500, how would we do this? How much extra do we get if we choose to use 11? This is step selection .

Question 2: If I want to Monte Carlo simulate the S&P 500 do I need to generate 500 normal random variables for each stock or can i e↵ectively do it with fewer? This is PCA .

1.2 Multicollinearity

In the case of {(yi ,x,x)} with x = x Ai 2 {1, 2 , . . . ,n} it looks like we have two predictors for y but in truth each has the same information, hence in practice we only have one . Mathematically this because an issue because both models

y = b ⇤ x(1) y = b ⇤ x(2)

will have the same residuals . In such a simple case, where one of our predictors can be ﬁtted completely by a linear combination of others we say that our correlation matrix is not of full rank and hence can not be inverted .

Here we are concerned with the slightly di↵erent case, where one predictor and be ﬁtted very well, but not completely, by a linear combinations of the others . In this case the correlation matrix is, in theory, invertable, however numerically it is challenging . The estimation of the inverted matrix will vary greatly with slight changes in the inputted data. To avoid this we look at dimension reduction techniques .

2 Step Selection: Not on Exam

If we have p predictors there are 2p possible linear models we can make . If p is not too large we can look at all possible models and choose the best . If p is large this is not possible and we are forced to look at another method .

Note: I am assuming this material was covered in your linear regression course . We have it here for completeness .

2.1 Forward Search

The search procedure starts with an empty subset, and at each step adds the predictor variable which has the best predictive value, e .g . , results in the largest increase in R2 . Once a variable is entered, it is not dropped .

2.2 Backward elimination

Begin with a model containing all potential explanatory variables . At each step we drop the explanatory variable with the least predictive value . The approach is computationally more cumbersome than the forward method .

2.3 Efroymson’s Method

Similar to the forward search approach except that when a new variable is entered partial correlations are consider to see if any of the variables in the model should now be dropped .

2.4 Issues with Stepwise Model Selection

• Stepwise regression should be used only for exploratory purpose or for purposes of pure prediction .

• It should not be used to hypothesis testing .

• Nominal signiﬁcance level used at each step is subject to inﬂation .

• Automated ﬁtting may lead to over-ﬁtting

• A↵ected by multicollinearity

• Dummy variables are usually treated individually

This is a more general technique used often in time series .

3 Principal Component Analysis

In practice, the term Principal Component Analysis (PCA) covers ﬁrst a rotation, followed by step selection on a new coordinate system .

The rotation methods are well covered in linear algebra and used in a variety of ﬁelds . In physics they are referred to as eigen vectors and eigen values . In math if something is well covered it means there are text books that can help us if we need it as opposed to having to invent a new infrastructure .

3.1 Principal Axis Analysis: Simple Example

Before discussing the rotation of data we will look rotating ﬁxed points . In Junior High School we learn to sketch a graph of

x2 y2

9 25

However the function

5x2 +8xy +5y2 = 1

appears more complicated . Writing this in matrix form we get

5x2 +8xy +5y2 = ⇥x y ⇤ [4(5) 5(4)][ ]y(x) = XT AX

A is both symmetric and positive deﬁnite hence we can decompose it

A = SDST = " 2 2(2)# [0(1) 9(0)]"2(2) 2 #

This gives us the information that our equation can be rewritten!

5x2 +8xy +5y2 = 1 ⇣ ⌘2 +9 ⇣ ⌘2

And we can now see our function is a rotated ellipse .

Our two eigenvalues, 1 and 9 the the diagonal of the center matrix . Our two eigenvectors columns (rows) of our ﬁrst (last matrix) . Notice, the length of each eigenvector is 1 and each eigenvector is orthogonal two the other eigenvectors .

3.2 Principal Component Analysis: Simple Example

For the last 250 days we see the closing price of two stocks, Bank of New York (BK) and Bank of America (BAC) . Below is a summary of the data

Table 1: Stock Data Summary

Min Max Mean Variance

BK BAC

36.96 29.77

62.07 48.54

47.38 38.16

47.82

28.82

Correlation 96

With the above information we can calculate a covariance matrix

◆ =

This matrix can be decomposed

35 .88◆

◆ = ◆✓ ◆✓ ◆

Notice the sum of our two eigenvalues equals the sum of our original variances . 47 .82 + 28 .82 = 75 .44 + 1 .21 . Plotting our data as well as two lines deﬁned by our eigenvectors,

0 .61 −0 .61

0 .79 0 .79

Another way to look at this is a portfolio that is long 0 .79 shares of BK and long 0 .61 shares of BAC will be uncorrelated from a portfolio that is short 0 .61 shares of BK and long 0 .79 shares of BAC .

We can added a third stock Game Stop, GME, which has a much lower correlation the banking sector . We see

(Cov(BK,GME)

Cov(BK,BAC)

Var(BAC)

Cov(BAC,GME)

1 0

35 .88

28 .82

11 .52

52 .94A

Decomposing this matrix we get

01 0

0 .53

−0 .31

−0 .79

1 0860(.)50 −0 .00A ( 0

41 .88

0(0) 1 0 1 .20A (0 .50

−0 .39

−0 .31

0 .87

−0 .00A

Notice: The ﬁrst eigenvector (most of the variance) is the broad market, the second eigenvector is the banks vs game stop, the last eigenvector is the spread of BK to BAC . The second egienvalue being roughly 35% of the total volatility gives some sense of how decouple GME is from the banking sector .

3.3 Principal Component Analysis: Generalized

For our predictor variables we can create a covariance matrix . A covariance matrix, by deﬁnition, is both symmetric and positive deﬁnite . This means it can decomposed into eigenvectors and eigenvalues . Our original predictor variables can be recombined by our eigenvectors into uncorrelated predicted . Our eigenvalues will tell use the amount of variance along these axis .

Figure 1: Two Stock Prices and Their Eigen Vectors

4 Criteria for predictor selection: Not on Exam

For both PCA and step methods each time another predictor is added our R2 will improve . Hence we need some methodology to decide how many predictors should be added . Often this is more of an art than a science however something numerical criteria have been established .

4.1 Adjusted R2

The basic idea is some sort of punishment function for including too many factors in the model . One method, proposed by Mordecai Ezekiel, is

Radj = 1 − (1 − R2 )

Where p is the number of explanatory variables and n is the number of samples . Notice the fraction is the degrees of freedom of the fraction in the regular R2 .

4.2 The F-Statistic Criterion

This is the method illustration in LR lecture 2 .

4.3 The CP Criterion (Mallow’s CP )

A small value implies a precise model

CP = − ⇥ 2 − 2(k + 1)⇤

Where SSEk is the SSE based on a subset having k predictors We have

E[Ck ] = k +1

We are looking for a set of X(k) such that

• Ck is small

• Ck ⇡ k +1

Note: This is a special case of the Akaike Information Criterion that we will see later .

4.4 Akaike Information Criterion (AIC)

This is a more general technique used often in time series .

5 Outliers and Inﬂuence Points

We all know that one outlier can greatly change our estimate of a slope . Therefore it is important to be able to determine which, if any, data points are outliers . With single factor regression this is easy. With multivariate regression this is much hard . Luckily the PCA can help with this .

5.1 Mahalanobis Distance

In multi-dimensional (n) space there is more than one way to measure the distance between two points, xp and xd . The most common is the Euclidean distance .

DE = q(xp) − xd))2 +(xp) − xd))2 + ··· +(x) − x) )2

That is not, however, the only distance . For example there is the Manhattan distance .

DMn = | x p) − x d) | + | xp) − xd) | + ··· + | x) − x) |

The idea of the Mahalanobis distance is to measure the distance between a point and a cloud of data . To accomplish this we ﬁnd the eigen vectors and eigen values of the could and then measure the distance of the point to the center of the could as a Euclidean distance in z-scoress . Using S as the covariance matrix of our cloud,

DM = q(xp − xd )T S − 1 (xp − xd )

P.C . Mahalanobis used this idea to measure a points distance from the center of a distribution µ. D, Mahalanobis’s deﬁnition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927 and is commonly used in clustering methods .

This distance is zero for P at the mean of D and grows as P moves away from the mean along each principal component axis . If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space . The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set .

5.2 Hoteling’s T-Squared distribution

Hoteling was a professor at Columbia for 15 years and is credited with developing PCA . Hoteling was

able to describe the distsribution associated with a Mahalanbis distance . This allows for the tests of signﬁcance .

5.3 Useful YouTube Lectures

PCA involves rotations so video is a nice way to present the structure . Here are 3 videos that I think are useful . They are from UC Santa Cruz Video 1 Video 2 Video 3

References

2023-04-19

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple