闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Econ6083

SELECTING REGRESSORS USING THE BAYESIAN INFORMATION

CRITERION (BIC)

1. SELEcT1Nc REcREssoRs

We will discuss the problem of selecting relevant regressors in the context of the linear regression model. However, the procedures discussed here can be generalized and similarly

applied with nonlinear models such as logit and etc. Consider a linear regression model with

k potential regressors :

(1.1)

Yi = ù βjXi,j + Ui ,

j大ì

EXi,jUi = 0, j = 1, . . . , k.

For now we assume that the number of potential regressors is small: k is ﬁxed an does not

depend on n.

Let the set A denote the list of regressors with non-zero coeﬃcients:

A = {j : βj 0} .

For example, A = {1, 3, 7} implies that only the regressors Xi,ì , Xi,4, and Xi,3 have non-zero coeﬃcients, and that the remaining regressors have coeﬃcients equal to zero. We use A6 to denote the true set of relevant regressors : i.e. the true data generating process (DGP) for Yi only includes the regressors in A6:

Yi = ù βjXi,j + Ui .

jAAО

Our goal is to estimate A6 using the data {(Yi, X )i(二) 二 , i = 1, . . . , n}. We use Aˆn to denote an estimated set of relevant regressors produced by a selection procedure. We say that the selection procedure is consistent if

(1.2) P ╱Aˆn = A6、→ 1

as n → o.

Let β = (βì, . . . , βk)二 , and by βA we denote the subvector of β that includes only the coeﬃcients in A:

βA = (βj : j e A) .

We use IAI to denote the number of elements in A, and hence βA is a IAI-subvector of the

k-vector β .

Suppose a procedure produced a set of selected regressors Aˆn and vector of estimates βˆn = (βˆn,ì , . . . , βˆn,k)二 . Note that it is reasonable to set βˆn,j = 0 for j Aˆn. We say that the procedure is oracle if in addition to the consistency property in (1.2),

ln(βˆAО - βAО) →d N (0, V (A6)),

where V (A6) is the best asymptotic variance one can obtain when the true model A6 is known. The oracle property means that not only the econometrician consistently selects the true regressors, but also the coeﬃcients on the relevant regressors are estimated as precisely as when the set of the true relevant regressors in the DGP is known.

2. BIC

Recall that if the econometrician tries to select the regressors by minimizing the sample sum of squared residuals (SSR), or equivalently maximizing R] , the procedure would result in overﬁtting: the SSR is monotone non-increasing in the number of included regressors. The idea behind BIC is to penalize the SSR for the model complexity.

Let Xi = (Xi,ì, . . . , Xi,k)二 , and deﬁne Xi,A as the subvector of Xi that includes only the regressors in A:

Xi,A = (Xi,j : j e A).

Again, Xi,A is a IAI-subvector of the k-vector Xi. The true DGP can now be written as

Yi = ù βjXi,j + Ui

jAAО

= Xi(二),AОβAО + Ui .

Let βˆn,A(A) denote the OLS estimator of βA that only uses the regressors in A:

βˆn,A(A) = ╱ Xi,AXi(二),A\去 ì Xi,AYi .

We can set

βˆn,AC (A) = 0,

and view βˆn(A) = (βˆn,A(A)二 , βˆn,AC (A)二 )二 as the estimator of β = (βA(二), βA(二)C )二 under the model A. The corresponding SSR is given by

SSRn(A) = ù(n) ╱Yi - Xi(二),A βˆn,A(A)、] .

i大ì

The complexity of the model A can be measured by the number of included regressors, i.e. the number of elements in A. BIC for the model A is deﬁned as

BICn(A) = SSRn(A) + IAI log n,

where the second term is a penalty term. Note that a model with more included regressors receives a larger penalty. A BIC-based selection procedure selects the regressors by minimizing

BIC across all possible models:

Aˆn(BIC) = arg A(mi)n BICn(A).

We show below that BIC selects the relevant regressors consistently.

Proposition 2.1. Suppose that data are iid, EXiXi(二) and EUi]XiXi(二) are ﬁnite and positive deﬁnite, and EUi] < o. Then P ╱Aˆn(BIC) = A6、→ 1 as n → o.

Proof. It suﬃces to show that for all A A6

(2.1) P (BICn(A) > BICn(A6)) → 1,

i.e. the true model A6 minimizes BIC with probability approaching one.

First, consider the average SSR for the true model:

n去 ì SSRn(A6) = n去 ì ù(n) ╱Yi - Xi(二),AОβˆn,AО(A6)、]

i大ì

= n去 ìù(n) ╱ Ui - Xi(二),AО ( n,AОβˆ (A6) - βAО)、]

i大ì

= n去 ì Ui] + (βˆn,AО (A6 ) - βAО)二 ╱n去 ì Xi,AОXi(二),AО\ (βˆn,AО (A6) - βAО)

- 2 ╱n去 ì Xi,AОUi \ (βˆn,AО(A6) - βAО)

= EUi] + op(1),

where the op(1) term in the last line is by the LLN and consistency of the OLS estimator under the true model:

n去 ì ù Ui] = EUi] + op(1),

i大ì

βˆn,AО = βAО + op(1),

n去 ì ù Xi,AОXi(二),AО = EXi,AОXi(二),AО + op(1),

i大ì

n去 ì ù Xi,AОUi = op(1).

i大ì

Suppose that a model A omits some relevant regressors:

(A n A6) A6 .

Since the OLS estimator is in general inconsistent when there are omitted relevant regressors,

βˆn(A) - β →p δ 0,

where βˆn,j(A) is the corresponding element of βˆn,A(A) for j e A, and βˆn(A) = 0 for j A.

We have:

n去 ì SSRn(A) = n去 ì ù(n) ╱Yi - Xi(二)βˆn(A)、]

i大ì

= n去 ì ù(n) ╱ Ui - Xi(二) ╱βˆn(A) - β、、]

i大ì

= n去 ì Ui] + (βˆn (A) - β)二 ╱n去 ì XiXi(二)\ (βˆn(A) - β)

- 2 ╱n去 ì Xi Ui \ (βˆn(A) - β)

= EUi] + δ二 EXiX δ + op(1)i(二) .

Note also that

IAI = o(1).

Therefore, for such a model A,

P (BICn(A) > BICn(A6)) = P ╱n去 ì BICn(A) > n去 ì BICn(A6)、

= P ╱n去 ì SSRn(A) + IAI lo n > n 去 ì SSRn(A6) + IA6 I 、

= P ╱δ二 EXiX δ + op(1) + o(1) > 0i(二)、

→ 1

where convergence in the last line holds because δ 0 and EXiXi(二) is positive deﬁnite.

Next, consider a model A such that

A6 c A.

In this case, A contains all the relevant regressors as well as some irrelevant. The OLS estimator βˆn,A(A) is consistent and asymptotically normal:

nì/] (βˆn,A(A) - βA ) →d ΨA ,

where

ΨA ~ N (0, V (A)) ,

V (A) = σ] ╱EXi,AXi(二),A、去 ì .

The result follows from

n去 ì/] ù XiUi →d ΦA ,

i大ì

where

ΦA ~ N ╱0, σ]Xi,AXi(二),A、.

We have:

SSRn(A) -ù(n) Ui] = ù(n) ╱ Ui - Xi(二),A ( n,Aβˆ (A) - βA )、] - ù(n) Ui]

i大ì i大ì i大ì

= nì/](βˆn,A(A) - βA )二 ╱n去 ì Xi,AXi(二),A\ nì/](βˆn,A(A) - βA )

- 2 ╱n去 ì/] Xi,AUi \ nì/](βˆn,A(A) - βA )

→d ΨA(二) ╱EXi,AXi(二),A、ΨA - 2ΦA(二)ΨA

= Op(1).

By the same arguments,

SSRn(A6) - ù Ui] →d ΨA(二)О ╱EXi,AОXi(二),AО、ΨAО - 2ΦA(二)О ΨAО

i大ì

= Op(1).

Lastly, when A6 c A,

P (BICn(A) > BICn(A6)) = P (SSRn(A) - SSRn(A6) > (IA6 I - IAI) log n)

= P (Op(1) > (IA6 I - IAI) log n)

→ 1,

where convergence in the last line holds since IA6 I < IAI, and therefore

(IA6 I - IAI) logn → -o.

□

3. PosT BIC 1NFERENcE

Suppose the econometrician selects the true model using Aˆn(BIC) and conducts inference using βˆn(Aˆn(BIC)). For j e Aˆn(BIC), the distribution of the estimator of the j-the coeﬃcient is given by

P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u、

= P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u, Aˆn(BIC) = A6、

+ P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u, Aˆn(BIC) A6、

= P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u, Aˆn(BIC) = A6、+ o(1)

= P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u I Aˆn(BIC) = A6、P ╱Aˆn(BIC) = A6、+ o(1) = P ╱nì/] ( n,j(A6βˆ ) - βj) < u、(1 + o(1)) + o(1)

= P ╱nì/] ( n,j(A6βˆ ) - βj) < u、+ o(1).

where the second equality holds by

P ╱nì/] (βˆn,j(Aˆn(BIC)) - βj) < u, Aˆn(BIC) A6、< P ╱Aˆn(BIC) A6、= o(1).

Hence, the BIC-based selection and estimation procedure is an oracle procedure.

4. AKA1KE INFoRMAT1oN CR1TER1oN (AIC)

AIC is another popular criterion for model selection (and actually precedes BIC). AIC for a model A is deﬁned as

AICn(A) = SSRn(A) + 2IAI.

In comparison with BIC, AIC penalizes the model complexity less heavily and, therefore, tends to select a bigger model with more regressors than BIC.

By the same arguments as in the proof of Proposition 2.1, for a model A that omits some relevant regressors, i.e. (A n A6) A6 ,

P (AICn(A) > AICn(A6)) → 1.

However, because AIC penalty is not suﬃciently strong, if A6 c A,

P (AICn(A) > AICn(A6)) → 1.

Hence, while AIC detects omitted regressors with probability approaching one, it is more likely to overﬁt by also including some irrelevant regressors than BIC.

5. L1M1TAT1oNs

One should note several limitations of our arguments. First, we assumed that k is small (ﬁxed) and some of our arguments do not apply when the number of potential regressors is comparable to the sample size. However, this technical issue can be resolved with somewhat diﬀerent arguments.

More importantly, our analysis ignores the situation where some βj, while non-zero, are very close to zero. It is unreasonable to expect that the BIC (or any other procedure) can detect such small coeﬃcients with a probability approaching one. Thus, even in the limit, regressors with very small coeﬃcients are likely to be omitted from the model, which can potentially create an omitted variable bias. This shortcoming can be addressed using a double selection procedure, which will be discussed later in the context of Lasso.

Lastly, the BIC procedure may be infeasible in practice if the number of potential regressors is very large. There are 2k possible models A, and if k = 30 one has to run and compare over 1 billion potential regressions. For k = 40, one has to run over 1 trillion models. For example, suppose that the econometrician considers ﬂexible speciﬁcations that include quadratic terms as well as pairwise interaction terms of the right-hand side variables. In that case, 10 potential right-hand side variables generate 65 potential regressors.

APPEND1x A. LAw oF LARcE NUMBERs (LLN)

We say that θˆn converges in probability to θ if for all ∈ > 0,

nl_2(im) P ╱ | nθˆ - θ| > ∈、= 0.

Convergence in probability implies that the probability of θˆn deviating from θ by any amount ∈ > 0 becomes negligible as n → o. We use the notation

θˆn - θ →p 0

and

θˆn - θ = op(1).

The main device for establishing convergence in probability is the law of large numbers. Let Xì, . . . , Xn be uncorrelated random variables with EXi = µ and Var(Xi) = σ ] , and consider

the average

n = n去 ì ù Xi .

i大ì

Note that

E n = µ,

Hence, as n → o the distribution of the average n becomes concentrated around the mean µ. More formally, by Markov’s inequality

(A.2) P ╱ │n - µ │ > ∈、< E │n - µ │] = → 0.

To show (A.1), which also implies the equality in (A.2),

Var(n ) = Var ╱n去 ì Xi \

= n去] Var ╱ Xi \

= n去] Var(Xi) + j大/i(ù)Cov(Xi, Xj)│(!)

= n去] ù Var(Xi)

i大ì

去 ì ]

where the equality in the fourth line holds because we assume that Cov(Xi, Xj) = 0 for i j , and the equality in the last line holds because Var(Xi) = σ ] .

In the case of iid data, the following result can be used. Let Xì, . . . , Xn be iid random variables such that EIXiI < o. Then

i大ì

or equivalently

n = EXi + op(1).

APPEND1x B. CoNs1sTENcy oF 0LS

We say that the OLS estimator βˆ is consistent for the true β if

βˆ = (X二X)去 ì X二Y = ╱n去 ì XiXi(二)\去 ì n去 ì XiYi →p β

Proposition B.1. Suppose that data {(Yi, Xi) : i = 1, . . . , n} are iid,

(B.1) Yi = Xi(二)β + Ui ,

EUiXi = 0,

(B.2) EXiXi(二) is ﬁnite and positive deﬁnite.

Then,

βˆ →p β.

Proof. Write

βˆ = ╱n去 ì XiXi(二)\去 ì n去 ì XiYi

= β + ╱n去 ì XiXi(二)\去 ì n去 ì XiUi

= β +╱EXiXi(二) + op(1)、去 ì (EXiUi + op(1))

→p β +╱EXiXi(二)、去 ì . 0

= β.

□

The OLS estimator is inconsistent when

EXiUi 0.

In this case,

βˆ = ╱n去 ì XiXi(二)\去 ì n去 ì XiYi

= β + ╱n去 ì XiXi(二)\去 ì n去 ì XiUi

→p β +╱EXiXi(二)、去 ì EXiUi

β,

where

╱EXiXi(二)、去 ì EXiUi 0

can be viewed as asymptotic bias. For example, suppose the true model is given by

Yi = Xi(二), ìβì + Xi(二),]β] + ∈i ,

EXi,ì∈i = 0,

EXi,]∈i = 0,

but the econometrician omits Xi,] from the model:

Yi = Xi(二), ìβì + Ui ,

Ui = Xi(二),]β] + ∈i .

Then

EXi,ìUi = EXi,ìXi(二),]β] 0,

and

β˜ì = ╱n去 ì Xi,ìXi(二), ì\去 ì n去 ì Xi,ìYi →p βì +╱EXi,ìXi(二), ì 、去 ì EXi,ìXi(二),]β] .

APPEND1x C. CoNvERcENcE 1N D1sTR1BUT1oN AND AsyMPToT1c NoRMAL1Ty

Let θˆn denote an estimator of a scalar parameter θ. To perform hypothesis testing about θ (or construct a conﬁdence interval for θ) using the estimator θˆn, one needs to know the distribution of the latter. Unfortunately, in many circumstances, it is impossible to derive the exact distribution of θˆn either because the expression is too complicated, or because the derivation of the exact ﬁnite sample distribution requires very restrictive assumptions. In such cases, we rely on asymptotic approximations that are usually applied to ln(θˆn - θ), i.e. we approximate the distribution of the scaled estimation error. The scaling is necessary when θˆn - θ →p 0. While the probability

P (ln(θˆn - θ) < x)

is unknown for ﬁnite n, suppose we can establish that for all x e 皿,

lim P (ln(θˆn - θ) < x) = P (X < x) ,

where X ~ N (0, ω] ). In such cases we say that ln(θˆn - θ) converges in distribution to a normal random variable, denoted as

(C.1) ln(θˆn - θ) →d N (0, ω] ).

We use the N (0, ω] ) distribution to approximate that of ln(θˆn - θ).

Suppose that (C.1) holds. Then for any M > 0,

P ╱ Iln( nθˆ - θ)I > M、→ P (IXI > M) where X ~ N (0, ω] ).

Thus, in large samples the probability of ln( θˆn - θ) taking on a large value greater than M is approximately the same as that of a normal random variable. Since limM_2 P (IXI > M) = 0, we say that ln(θˆn - θ) is bounded in probability, denoted as

ln(θˆn - θ) = Op(1).

We can also write

θˆn = θ + Op(1) = θ + Op ╱ 、 ,

i.e. θˆn converges to θ at the rate 1/ln.

The concept can be extended to random vectors by considering the joint distribution of its elements. Suppose now that the random k-vector θˆn is an estimator of θ e 皿k. Suppose further that for all x = (xì, . . . , xk)二 e 皿k ,

nl_2(im) P ╱ ln( n,ìθˆ - θ ì) < xì , . . . , ln( n,kθˆ - θk) < xk、= P (Xì < xì, . . . , Xk < xk) ,

where for some positive deﬁnite and symmetric k x k matrix Ω,

╱ Xì !

( ）

| Xk │

where N (0, Ω) denotes the multivariate normal distribution with zero means and a variance- covariance matrix given by Ω. Then we say that ln(θˆn - θ) converges in distribution to the N (0, Ω) random vector, denoted as