Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Group project instructions

ECON 5079

Econometrics

Experiments with sparse regression models

We will be generating data from regression models in order to understand properties of various estimators and procedures. Our basic framework requires to generate p predictors in matrix x and a target variable y as follows:

xi     ∼ Np (0, S)                                                                        (1) yi     =   β1 xi1 + ... + βp xip + εi ,   εi  ∼ N(0,σ2 )                            (2)

for i = 1, ...,n, where S{jk}  = ρ |jk| for some correlation level −1 ≤ ρ ≤ 1 and for elements j,k ∈ {1, ...,p}.

1. Use  the  code  Monte Carlo bias.m  and  do  various  experiments  in  order  to demonstrate the importance of omitted variable bias on econometric estimates (see Appendix A for more details and guidance).

2. Write code that explores the opposite issue, i.e. what happens if we generate from a regression with three significant predictors but we estimate a regression with p − 3 (p  >  3) additional predictors that are irrelevant?  Which is more hurtful for regression,  omitting an important predictor or including an irrelevant one? (Hint: make sure you are thorough enough and explore the effect of various choices n,p,σ 2 ,ρ)

3. Variable  selection  for  small  p  Generate  10 predictors in x  and perform an information-theoretic model averaging approach similar to Pesaran and Timmerman (1995, Journal of Finance) and Kapetanios, Labhard and Price (2008, Journal of Business & Economic Statistics). Write a short MATLAB code that scans through all 210  possible model specifications, estimates each one using OLS, and calculates some measure of fit of your preference (e.g.  BIC, AIC, adjusted R2  etc).  Find the model with the highest probability of being the“best” model.  Notes on the procedure are in Appendix B.

4. Variable  selection  for  large p Use the lasso and elastic net to perform high- dimensional variable selection using 5-fold cross validation. Set p large and explore cases where p ≫ n.  Alongside the other choices (σ2 ,ρ) explain in which cases the lasso/elastic net choose the correct variables.

HEALTH WARNINGS:

❼ I won’t accept a sloppy copy-paste of a million tables without structure, motivation

and scientific structure. Your main task is to build a story and explain what works and what doesn’t, in a structured and thorough way. Your report should be scientific and evidence based, and not opinion or intuition-based like a newspaper article or a blog piece.

❼ You should submit all your code in clear and reproducible form. I won’t accept use of build-in functions (other than the functions for lasso/elastic net).

❼ You can use MATLAB, Python or R. I can read other languages, but it will be harder for me to run your code and replicate things, so you are advised NOT to work in C++, Java, Stata etc.

References

[1] Kapetanios, G., Labhard, V. and Price, S. (2008) Forecasting Using Bayesian and Information-Theoretic Model Averaging, Journal of Business & Economic Statistics, 26(1), 33-41.

[2] Pesaran,  M.H.  and  Timmermann,  A.  (1995),  Predictability  of  Stock  Returns: Robustness and Economic Significance. The Journal of Finance, 50, 1201-1228.

[3] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288.

Appendices

A    Assessment of Omitted Variable Bias

Assume that the true DGP for our data y is

yi  = βx xi + βz zi + εt

but we instead estimate

yi  = bx xi + εi

(A.1)

(A.2)

How              does              omitting              zi                       from              the              model

affect the least-squares (LS) estimate of bx ?  Is x  a reliable estimate of the true βx ?

Using matrix notation (y = xβx + zβz + ε), it follows that

x     =   (x\ x)1 x\y

=   (x\x)1 x\ (xβx + zβz + ε)

=   βx + βz (x\x)1 x\ z + (x\x)1 x\ ε

Thus, x  is unbiased estimate of βx  (i.e. E(x ) = βx ) if

1. E(xi εi ) = 0;

2. βz  = 0 or E(xi zi ) = 0.

(A.3) (A.4) (A.5)

In words,

1. The regressor xi  and error εi  are uncorrelated (part of the usual assumptions in the linear regression model);

2. zi  is not relevant for yt , or xi  and zi  are uncorrelated.

In practice (and especially for economic data) xi  and zi will be highly correlated, meaning that omitted variable bias (OVB) can become a serious threat for regression analysis (especially when we omit many z variables from our regression). Your task is to use the provided code to illustrate as accurately as you can how serious this can be in different scenarios.

The code MONTE CARLO bias.m does two simple things:

1. Generate data y from a regression model with p, possibly correlated, predictors X . The coefficients β and σ 2  are known to us (i.e. we select their values). That way, we generate a finite sample of n values from y,X but we know what process generated these data, which will allow us to assess how close to the “truth” (i.e. the values of β and σ 2  we selected) various estimators are.

2. Using the generated data y  and X  it solves a simple OLS estimation problem providing two estimators:  one where all p predictors are used  (called beta OLS in the code), and one where we only use the first predictor in X as an example of committing an omitted variable bias (it is the vector beta OLS omitbias in the code).

First play around with the code to get a feeling of what it does, and what results it produces. Next try to devise different scenarios of omitted variable bias. Try to see what happens for different values of the DGP parameters (see code):  n,rho,p,sigma2,beta. When is the bias substantial and when is the bias less of a concern?

B    Information    Theoretic    Model    Selection    and Averaging

Pesaran and Timmermann (1995, Journal of Finance) consider the following stock return prediction model

ρt  = βXt + εt ,                                                   (B.1)

where ρt  are stock returns in excess of the risk-free rate, and Xt  are the following available predictors

Xt     =   [YSPt1,EPt1,I1t1,I1t2,I12t1,I12t2 , ... Πt2 , IPt2 , Mt2]

❼ Namely: dividend yield, earnings-price ratio, 1-month T-bill rate, 12-month T-bond

rate, inflation, industrial production, M0

❼ Some variables only appear with a second lag (e.g. ∆IPt2) because of publication

lags from the relevant statistical offices

❼ Consider all possible model combinations based on these 9 variables

❼ A   variable   is   either   included   (1)   or   excluded   (0)   from   the   regression ⇒ leading to 29  = 512 possible models

❼ Denote model Mi , i = 1, ..., 512 as

ρt  = βXt(i) + εt ,

where Xt(i)  has the predictors of model i

(B.2)

2

❼ Estimate all models and then store BICi , Ri

❼ Pesaran and Timmermann actually use economic criteria (Sharpe ratio) to select

the optimal model 

❼ With modern PCs one can easily enumerate  deterministically  and estimate all

possible model combinations when facing 30-40 predictors

❼ With more than 40 predictors it is computationally infeasible to estimate all possible regression models1, but stochastic  algorithms exist that find the most probable models – we will see such algorithms during the lectures on Bayesian inference

❼ However, when forecasting stock returns, or exchange rates (or inflation, as we will

see next), predictors are unstable - some variables forecast well some periods, others not

There is a way to reduce the risk associated with selecting a single model

❼ This procedure is called model averaging

Consider the case of two variables, i.e. 22  = 4 models

ρt     =   β0 + εt ,                                                                 (B.3)

ρt     =   β0 + β1X1,t + εt ,                                                   (B.4)

ρt     =   β0 + β2X2,t + εt ,                                                   (B.5)

ρt     =   β0 + β1X1,t + β2X2,t + εt ,                                    (B.6)

and their associated 4 BIC values:   BIC1 ,BIC2 ,BIC3 ,BIC4 .   Kapetanios,  Labhard

and Price (2008, Forecasting using Bayesian and information theoretic model averaging, Journal of Business and Economic Statistics) show that we can convert these into model probabilities:

exp(0.5(BICi min(BIC)))     πMi   =

(B.7)

where notice that (for numerical stability, i.e.  in order to avoid overflow/underflow) we subtract from each BIC value the minimum value attained by the BIC over all models. We can now use these model probabilities to construct probabilities for each variable of interest, i.e. variable-specific probabilities :

X1,t  has probability equal to ωX1   = πM2  + πM4

❼ X2,t  has probability equal to ωX2   = πM3  + πM4

Such probabilities are also called probabilities of inclusion of each variable not to be confused with p-values from independent t-tests.

The ideas above generalize to models with p predictors as long as the number of predictors is less than 30-40.  For example, with p = 50 we have 250  models which is a vast number.  Even if it takes you 0.001 seconds to estimate a single model, you would need 36000 years to estimate all 250  models!