闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BA222 - Lecture Notes 11: Multivariate Regression Models (Part II)

Table of Contents

• Introduction

• Specifying Regression Models for Forecasting

• Adjusted R-Squared

• Variable Selection for Forecasting

• Forward Selection

• Backwards Selection:

• Specifying Regression Models for Policy Making

• Omitted Variable Bias

• Identifying Confounding Variables and Correcting OVB

• Identifying Confounding Variables

• Correcting Omitted Variable Bias

• Example of Identifying and Correcting for OVB

Introduction

A common challenge for analysts and researchers using multivariate regression models is to decide which variables to include and which ones to omit. The answer to this question depends on the main objective of the model. Models with a common dependent variable (y variable) can include a widely different set of independent variables depending on their application.

The two most important applications of regression models are: (1) Forecasting and (2) Policy Making. These notes describe processes to help you decide which variables to include(and omit) when specifying regression models.

Specifying Regression Models for Forecasting

When specifying a multivariate regression model for forecasting the goal should be to maximize the predictive power of the model. Our ﬁst task is then to learn: (1) how to measure the predictive power of a model in multivariate regression models and (2) how to decide which variables to include in order to maximize the predictive

power.

We are presenting the topic of prediction here separate from the analysis of time series, which we'll discuss later in the semester. But most of the principles discussed here also apply for time series.

Adjusted R-Squared

Previously we introduced the R-Squared as a metric of goodness of ﬁt with a very simply rule: the closer the R-Squared is to one the better the model ﬁts the data. For multivariate regression models we'll use instead the adjusted R-Squared ( adj − R2), which is a modiﬁcation of the original R-Squared (R2). In summary, it penalizes models that include independent variables (regressors) that don't contribute to the overall goodness of ﬁt. We'll call this variables redundant variables.

The formula for the adjusted R-Squared is given by:

where R2 is the original R-Squared, k is the number of explanatory variables and n is the sample size. You can clearly see that if k increases the adjusted R-Squared goes down. This means that if a new variable is added (k increases) and the R2

does not increase by an important amount, the adjusted R2 will go down.

In simple terms, variables that are added to the model and don't add a sufﬁcient level of explanatory power will reduce adj − R2 . Variables that increase the predictive power, increase adj − R2

Just like the original R2 , adj − R2 is a model selection statistic. We use it to compare models with the same dependent variable. The higher the adj − R2 the better the ﬁt. In that way you can estimate different regression models (with many combinations of independent variables) and use the adjusted R-Squared to identify the model that best ﬁts the data.

To extract the adjusted R-Squared on Python is as simple as extracting the regular R-Squared. You just need to use .rsquared_adj instead of .rsquared from the regression results.

Practice

1. Estimate the following regression models with price as the dependent variable. (1) size, (2) size and bedrooms, (3) size and full bathrooms

2. For each model extract the R2 and the adj − R2

3. Which one of the three models ﬁts the data best?

Using the R2 you cannot answer this questions correctly as the R2 will always go up or stay the same when new variables are added. On the other hand, it is clear that adding the number of bedrooms is not improving the goodness of ﬁt (the adjusted R-Squared is reduced), but adding the number of bathrooms seems like a good idea (the adjusted R-Squared increased). Meaning that the model with size and full bathrooms is a good idea.

These results don't imply that the number of bedrooms have no predictive power at all. But, in a regression that already includes size as an explanatory variable, it is not improving the goodness of ﬁt in a signiﬁcant manner. Essentially, size is already explaining most of the variation in y that the number of bedrooms explains.

Let's try now repeating this exercise, but starting with a model with just the number of bedrooms and note how we reach different conclusions:

Practice

1. Estimate the following regression models with price as the dependent variable. (1) bedrooms, (2) bedrooms and size, (3) bedrooms and full bathrooms.

2. Which one of the three models ﬁts the data best?

Now the result is that size should be added but not number of full-bathrooms. This is because most of the explanatory power that size is adding is already included in number of bedrooms. The fact that the adjusted R-Squared may go up or stay the same depending on which variables were included ﬁrst in the regression makes it hard for us to know the order in which regressors should be included. The next section describes two strategies that can be used to avoid this issue.

Variable Selection for Forecasting

There are many strategies and metrics that can be used to specify models with high predictive power. This being an introductory course I will only present two methodologies, but know that there are many other methods and criteria that can be used.

Forward Selection

In forward selection we start with the variable that individually has the most predictive power (highest correlation coefﬁcient) and then add additional variables in the order given by the magnitude of the change in adjusted R-Squared. This process is outlined in ﬁve steps below:

1. Compute the correlation with respect to the dependent variable for all variables in your dataset.

2. The ﬁrst independent variable should be the one with the highest correlation coefﬁcient.

3. Add new variables according to their contribution to the adjusted R − Squared. Meaning that you will estimate models that include the original x variable from step two and add a single additional variable. The model that increases the adj − R2 the most should be selected to continue.

4. Starting with the model selected at the end of step (3). Repeat step (3).

5. Stop once no model increases the adjusted R-squared.

Example:

Let's apply forward selection to the brookline .csv data:

Step 1 and 2:

Compute the correlation between price and all the other variables in the model. The highest correlation is between price and size , thus our ﬁrst regression should include size as the only independent variable.

price = β0 + β1 Size + Error, adj − R2 = 0.7488

Step 3:

Now, we are going to estimate new models by adding a single new variable to the original model and selecting the one with the highest adjusted R2 . Below you can see an example of using a loop to make this task easier:

variables = ['beacon', 'baseFloor', 'elevators', 'rooms', 'bedrooms', 'fullBathrooms', 'halfBathrooms', 'garage', 'C(buildingStyle)'] rsquared_adjs = list()

for j in variables:

formula = 'price ~ size + ' + j

adjR2 = smf .ols(formula, data = br) .fit() .rsquared_adj

rsquared_adjs .append(adjR2)

print(formula + ':', adjR2)

maxADJR2 = max(rsquared_adjs) # EXTRACTING MAXIMUM ADJ -R2

pos = rsquared_adjs .index(maxADJR2) # FINDING POSITION OF MAX ADJ -R2 variable = variables[pos] # EXTRACTING VARIABLE

print()

print("Next Variable to Include:", variable)

Therefore the next variable to include in the model is garage :

price = β0 + β1 Size + β2 Garage + Error, adj − R2 = 0.7653

Step 4:

We can now repeat the process, starting with a model that includes garage:

variables = ['beacon', 'baseFloor', 'elevators', 'rooms', 'bedrooms',

'fullBathrooms', 'halfBathrooms', 'C(buildingStyle)']

rsquared_adjs = list()

for j in variables:

formula = 'price ~ size + garage + ' + j

adjR2 = smf .ols(formula, data = br) .fit() .rsquared_adj

rsquared_adjs .append(adjR2)

print(formula + ':', adjR2)

maxADJR2 = max(rsquared_adjs) # EXTRACTING MAXIMUM AJD -R2

pos = rsquared_adjs .index(maxADJR2) # FINDING POSITION OF MAX ADJ -R2

variable = variables[pos] # EXTRACTING VARIABLE

print()

print("Variable to Include:", variable)

Therefore the next variable to include in the model is fullBathrooms : price = β0 + β1 Size + β2 Garage + β3 Full − Bathrooms + Error,

Step 5:

adj − R2 = 0.7732

We can continue repeating this process until we end up with a model that doesn't increase the adjusted R-Squared. If you followed the procedure correctly, the ﬁnal model should include the following variables (presented in the order they should be added):

1. Size

2. Garage

3. Full-Bathrooms

4. Building Style

5. Half-Bathrooms

6. Beacon

7. Elevators

There are three redundant variables that should not be included: Base Floor, Rooms and Bedrooms. The ﬁnal Adjusted R-Squared is 0.8008.

Backwards Selection:

This is an alternative method to select the number of variables for a model designed for making predictions. In simple terms you start with a model that includes all possible regressors in the data and then, out of all the variables that are not statistically signiﬁcative, exclude variable that is the least signiﬁcative (you can use the magnitude of the P-value for this). Then repeat until all the variables are statistically signiﬁcative. The following steps outline this process:

1. Start with a model including all the independent variables in the dataset.

2. Eliminate the variable with the highest p − value (the least signiﬁcant) that is statistically not signiﬁcative. Estimate the model without this variable. Note: For categorical variables, only eliminate them when all the categories are simultaneously insigniﬁcant.

3. Repeat step (2) until you are left with a model in which all variables are statistically different than zero.

Example:

Let's apply backward selection to the brookline .csv data:

Step 1:

Start with a model that includes all the variables:

reg1 = smf .ols('price ~ size + beacon + baseFloor + C(buildingStyle)

+ elevators + rooms + bedrooms + fullBathrooms + halfBathrooms + garage', data = br)

results = reg1 .fit()

results .summary()

Step 2:

Now we start by eliminating the variable with the highest P − Value. Note that the category HIGH -RISE is not statistically signiﬁcative, but we should not eliminate the C(buildingStyle) variable as there are other categories that are statistically different than zero. Taking that into account, then we identify baseFloor as the ﬁrst variable with a beta coefﬁcient that is statistically not different than zero with the highest p-value (0.428).

Step 3:

We repeat this process of elimination until we are left with a model that includes only variables that are statistically signiﬁcative (p-value < 0.05). You should end up with: Size, Garage, Full-Bathrooms, Building Style, Half-Bathrooms, Beacon and Elevators. The variables that were eliminated are: Base Floor, Rooms and

Bedrooms.

Note that we got the same result as with forward selection, this is common but not always the case. You may end up with different results depending on the method that you choose. You can try both procedures and decide to use the one with the highest adjusted R-Squared.

Alternatively, you can use a combination of both methods. Apply forward selection ﬁrst and then use that as the starting model instead of a model with all the variables for backward selection.

Specifying Regression Models for Policy Making

Estimating the effect of a change in policy involves knowing the causal relation between the policy variable (x) and some outcome variable (y). Recall that causality is not the same as statistical association. A variable x is said to cause another variable y if: by changing x and, keeping everything else constant, the value of y changes.

Statistical association happens when changing x, y also changes, without controlling for other factors necessarily. Policy making based only on statistical association and not causality can be misleading. For instance, the fact that the sales of sunblock and ice-cream are statistically associated does not mean that in order to increase ice-cream sales we should offer a discount on sunblock.

Noting that causality and statistical association are not the same. When making a policy analysis, regression models should try to estimate the causal relation between x and y by controlling for other factors. In practice, it is impossible to control for everything as datasets have limitations, as a result we'll only get an approximation to the causal relation between x and y. But, in some instances, our approximation can be pretty good and useful for those making policy recommendations. In any case, is better than using a model without any control.

In the rest of these notes you'll learn about the statistical reasons why regression models can fail to estimate the causal relation between variables and then go over a methodology that allow us to approximate it.

Omitted Variable Bias

We'll frame this discussion by noting that our goal is policy making. In particular, we want to know the causal relation between a speciﬁc policy variable x and a policy outcome variable y. Other variables added to the model should serve that purpose and are of no particular interest to us.

A regression model suffers from omitted variable bias or OVB when there are variables z that are simultaneously related with the outcome variable y and the policy variable x. If thats the case, adding the variables z as a control variables to the regression model will correct the bias on the beta coefﬁcient of x generated by not including them in the ﬁrst place.

Example:

To understand the statistical reason of this problem let's start with simple example. Consider the problem of a school district that want to decide how much they should invest in a tutoring. Investing more in tutors will mean that more students

(especially those with fewer resources) will have greater access to tutoring.

To answer this policy question we want to estimate the causal relation between having a tutor (measured as a dummy where 0 is no tutor, and one is having a tutor) and academic performance (measured by a quiz on a 0-100 scale).

Using data from all the students in the school district we estimate the slope coefﬁcient using a univariate regression model and got a 20. Meaning that students that have a tutor performed better than those without a tutor by 20 points on average. Looking at this result, one may conclude that it is a great investment for the school district to spend more money on tutoring as the returns in terms of academic performance are very high.

Think about the role of income in this problem. Before any change in policy: students with higher income can afford to pay for a tutor while students with lower income generally cannot. This means that the policy variable (having a tutor) is going to be statistically associated with income. Additionally, individuals that come from lower income tend to perform worse academically on average (due to issues like nutrition, security, access to learning resources, parent's education, etc.). The policy outcome variable (academic performance) is statistically associated with income. Which means that it is not going to be rare on a sample like this to ﬁnd: (1) students with high income that have tutor and high academic performance, and, (2) students with low income that have a tutor and low academic performance. Note how income is creating a statistical association between academic performance and having a tutor.

Not controlling for income in the regression is going to produced a biased result. Meaning that the statistical association that we are observing between tutors and academic performance is the result of the confounding variable income. As a result, we cannot consider the estimated slope of 20 to be the causal relation between tutoring and academic performance as part of the magnitude of the estimated slope is the result of the effect of income.

Essential to the problem of OVB is that the variable z is related with BOTH the y and the x variable. If z is only related to y there is no bias. If z is only related to x there is no bias. You will only get OVB if z is related to both variables simultaneously.

The solution to the example problem is pretty simple. We can estimate the original regression model and add income as a control variable. In that way the estimated slope for tutoring will be estimated while keeping income constant.

Identifying Confounding Variables and Correcting OVB

The challenge of identifying and correcting OVB is that many different z variables can be simultaneously causing OVB. So, just because you ﬁx the OVB caused by one variable does not mean that the beta coefﬁcient is unbiased or completely free from OVB as there are other potential z variables that you are not controlling for.

In theory, we should be able to estimate the unbiased values of the beta coefﬁcients if we are able to include all the variables that may be related with y and x, but you know that it is practically impossible to have all potential z variables that may cause OVB in a database. Therefore, in the majority of applications some bias will always remains. We can only reduce it as much as possible.

Identifying Confounding Variables

In order to identify confounding variables one need to follow this procedure:

1. Identify the outcome variable (y) and the policy variable (x).

2. For numerical variables: Produce a correlation matrix and look for variables that are simultaneously correlated with x and y. These are confounding variables.

3. For categorical variables: Find evidence that the variable is statistically associated with both x and y. You can do this using two auxiliary regression. Let z be the categorical variable. Start by estimating a regression with y as the dependent variable and z as the only independent variable, if some of the beta coefﬁcients for the categories of z are statistically different than zero, the two variables are related. Then, proceed in the same way, but this time use x instead of y as the dependent variable. If you ﬁnd that z is statistically associated with both y and x then z is a confounding variable.

Correcting Omitted Variable Bias

Once you are done identifying confounding variables the ﬁx to OVB is actually pretty simple. You only need to include those variables in the regression model as controls.

When adding controls, some confounding variables will correct more bias than others based on the magnitude of their statistical association with respect to x and y. It is recommended that you only include variables that signiﬁcantly correct bias as other issues may arise when adding control variables (more on this next week).

The bias corrected by adding a control variable can be approximated by the change in the beta coefﬁcient of the policy variable. We have to decide, based on the scale of the variables, when a change is of practical signiﬁcance or not. If the change is very small then there is no need to add the confounding variable.

To help you decide when the change is small enough to ignore, think about the expected change in the policy variable and how the change in the beta coefﬁcient will affect the expected change in the outcome variable. If the change in the outcome variable remains essentially the same, then the adding the control has no practical importance and we can ignore it.

Example of Identifying and Correcting for OVB

Imagine that a real estate developer wants to know how to price apartments located on Beacon Street. The motivation for this comes from the fact that Beacon Street seems to be a desired location and as a result it should exhibit some form of price premium.

Therefore, in this analysis the policy variable will be beacon and the outcome variable will be price from the brookline .csv data.

Let's start the analysis by looking at the results of the univariate regression model:

reg1 = smf .ols('price ~ beacon', data = br) .fit()

reg1 .params

The estimated beta coefﬁcient for beacon is equal to -$ 46,969.18. Meaning that properties located on Beacon Street are $46,969.18 cheaper on average relative to properties not on Beacon Street.

Does it mean that the real estate developer should reduce the selling price properties on Beacon Street? No. This is a causal interpretation of a regression model with no correction for OVB. In other words, the estimated coefﬁcient may be greatly biased because the model includes no control variables.

Then, to identify which variables in the data are statistically associated with respect to price and beacon. Price is a numerical variable and we just need to use the correlation coefﬁcient for most of the variables (except buildingStyle for which we need a regression). Beacon is a categorical variable, so we'll have to estimate regressions to determine if variables are statistically associated.

Variables Associated with Price

Let's extract the correlation with respect to price

br .corr()['price']

Variables that are correlated with price:

• Size (High)

• Full-Bathrooms (Moderate/High)

• Rooms (Moderate)

• Bedrooms (Moderate)

• Half-Bathrooms (Moderate)

• Garage (Moderate)

• Elevators (Weak)

• Base Floor (Weak)

To determine the statistical association with buildingStyle :

regBStyle = smf .ols('price ~ C(buildingStyle)', data = br) .fit()

results = regBStyle .fit()

results .summary()

Because some of the categories are statistically different from zero it means that buildingStyle is statistically associated with price .

Variables Associated with Beacon

Now we need to use regressions to detect if any variable is statistically associated with beacon . I'll use a loop to make this process faster.

# For Numerical variables

variables = ['size', 'baseFloor', 'elevators', 'rooms', 'bedrooms', 'fullBathrooms', 'halfBathrooms', 'garage']

print('Variables statistically associated with beacon:')

print()

for j in variables:

formula = 'beacon ~ ' + j

pvalue = smf .ols(formula, data = br) .fit() .pvalues[1]

if pvalue <= 0 .05:

print(j)

# For Categorical Variable (only buildingStyle)

pvalue = smf .ols('beacon ~ C(buildingStyle)', data =

br) .fit() .pvalues

if any(pvalue[1:len(pvalue)] <= 0 .05):

print('Building Style')

Variables that are statistically associated with beacon :

• Size

• Full-Bathrooms

• Rooms

• Bedrooms

• Half-Bathrooms

• Building Style

Variables Associated with both Beacon and Price

Finally, we identify variables that are associated simultaneously with Beacon and

Price:

• Size

• Full-Bathrooms

• Rooms

• Bedrooms

• Half-Bathrooms

• Building Style

Correcting Omitted Variable Bias

Based on the previous list we are going to estimate models adding controls sequentially. To show you the procedure I'm going to start with size as this is the

variable that is more highly correlated with y and easier to understand. But then, we'll develop a method for deciding the order in which variables should be included.

reg2 = smf .ols('price ~ C(beacon) + size', data = br) .fit()

reg2 .params

Compare the beta coefﬁcient for Beacon from the univariate regression model with the current one. The value now is equal to $ 32,935. Meaning that on average, properties on beacon street are $ 32,935 more expensive than the rest, while controlling for size. Initially the coefﬁcient was equal to -$ 46,969.18 and now is $ 32,935, this means that the bias corrected was about (-$ 46,969.18 - $ 32,935 = - $79,904.18). In other words, not controlling for OVB was making us estimate the beta coefﬁcient with a bias of -$79,904.18. We consider this a signiﬁcant amount for policy makers, as a result size should be part of the model.

Now let's think about a general procedure to decide the order in which variables should be included. We are going to produce a method similar to forward selection. We'll add new variables in order determine the amount of bias they correct. Including ﬁrst the variables that correct the most bias:

The following loop will help us with this task:

variables = ['size', 'rooms', 'bedrooms', 'fullBathrooms',

'hal

2023-04-25

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios