闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BA222 - Lecture Notes 10: Multivariate Regression Models

Table of Contents

• Introduction

• Multivariate Regression Models

• Estimation in Python

• Interpretation of Beta Coefﬁcients

• Controlling for Other Factors

• Dummy Variables

• Interpretation of Beta Coefﬁcients for models with Dummy Variables

• Displaying Regressions Side-By-Side

• Exporting Regression Results

Introduction

In the previous lecture notes we introduced the linear regression model as a method to quantify the relation between two variables. Regression models are used for making policy decisions and forecasting.

The linear regression model is different from a simple linear equation because it includes an error term which represents all other factors that are related to y different than x.

In these notes you'll learn how to estimate regression models with more than one independent variable (more than one x variable). There are two main reasons to add more than one independent variable to the model:

• Forecasting: Generally, adding more variables to a regression model improves its predictive power. We'll learn how choose which variables improve predictive power and which variables don't.

• Policy Making: Adding more variables to a regression model can can reduce the bias that can be present in the estimation of key beta coefﬁcients used for

policy making. The reason why the beta coefﬁcients parameter may be biased has to do with the concept of confounding variables and we'll discuss it in detail next week.

In practice, regression models with a single x variables are seldom useful and are used generally as an exploratory ﬁrst step estimation of a model. As a rule, you should always question statistical results derived from models speciﬁed in this way.

Multivariate Regression Models

A multivariate regression model is a regression model that includes more than one independent variable. Formally,

y = β0 + β1 x1 + β2 x2 + … + βk xk + error

Where x1 , x2 , … , xk are the independent variables. β0 is the intercept and β1 , β2 , … , βk are the beta coefﬁcients associated to each respective independent variable.

In practice, we use multivariate regression models for two independent purposes:

• Improve the predictive power of a univariate regression model

• Reduce bias (more of this in the next lectures)

Estimation in Python

To estimate a multivariate regression model in Python is as simple as estimating a univariate model. As usual, start by loading packages and data, for this example we'll use the brookline .csv data:

import pandas as pd

import statsmodels .formula .api as smf

path =

'/Users/ccasso/Dropbox/BU/Teaching/2022/Fall/BA222/Data/brookline/bro okline .csv'

br = pd .read_csv(path)

The only difference with respect the estimation procedure of a univariate regression model is that in the formula term of the smf .ols() function we'll include more than one independent variable and we'll separate them using the symbol. For instance:

reg1 = smf .ols('price ~ size', data = br) .fit()

reg2 = smf .ols('price ~ size + bedrooms', data = br) .fit()

The ﬁrst model, reg1 , is simply a univariate regression model where price is the dependent variable and size is the independent variable.

price = β0 + β1 size + error

The second model, reg2 , is a multivariate regression model where price is the dependent variable and size and bedrooms are each an independent variables.

price = β0 + β1 size + β2 bedrooms + error

For multivariate models, you can extract the values of the beta coefﬁcients, residuals, ﬁtted values and residuals, standard errors, R2 and produce the regression summary table in the same way as you do for univariate regression models.

# Extracting parameters

reg1 .params

reg2 .params

# Extracting standard errors

reg1 .bse

reg2 .bse

# Extracting rsquared

reg1 .rsquared

reg2 .rsquared

# Extracting fitted values

reg1 .fittedvalues

reg2 .fittedvalues

# Extracting residuals

reg1 .resid

reg2 .resid

# Extracting regression summary table

reg1 .summary()

reg2 .summary()

Interpretation of Beta Coefﬁcients

In a multivariate regression model, the interpretation of the beta coefﬁcients changes in an important way.

Intercept: The intercept (β0) of a multivariate regression represents the average of the dependent variable (y) when all the dependent variables (x1 , x2 , … , xk) are simultaneously equal to zero.

Beta Coefﬁcients: The beta coefﬁcients of the independent variables (βi , with i ≥ 1) represent the average change in the dependent variable (y) when the independent variable i increases by one unit, while keeping the other independent variables constant.

Let's practice the interpretation of coefﬁcients using reg1 and reg2 :

Practice:

1. Estimate the beta coefﬁcients for the regression model reg1 and interpret

the beta coefﬁcients.

2. Estimate the beta coefﬁcients for the regression model reg2 and interpret

the beta coefﬁcients.

For reg1 the interpretation should be:

Intercept: The average price of a property in Brookline is about $ 12,934.12 if the size is of zero sq. ft. This makes no sense as properties cannot be of size zero.

Slope: For each additional one square foot increase in size, the average price of a

property in Brookline increases by about $ 407.45.

For reg2 the interpretation should be:

Intercept: The average price of a property in Brookline is about $ 12,934.12 if the size is of zero sq ft and the property has no bedrooms. This makes no sense as properties cannot be of size zero and it is unlikely that they have no bedrooms.

Beta Coefﬁcient for Size: For each additional one square foot increase in size, the average price of a property in Brookline increases by about $ 409.79. While keeping the number of bedrooms constant.

Beta Coefﬁcient for Bedrooms: For each additional bedroom, the average price of a property in Brookline decrease by about $ 1,894.33. While keeping the size constant.

Note that the interpretation for the number of bedrooms may be counter-intuitive, having more bedrooms reduce the price? Your intuition may tell you that it should be the opposite. Keep in mind that a multivariate regression models estimate the effect on the dependent variable while keeping the other independent variables constant. That is, if you have apartments of the same size, adding more bedrooms is associated with a reduced price.

Controlling for Other Factors

The great beneﬁt of a multivariate regression model is that it allow the analyst to differentiate the effect of a particular x variable on y while controlling for other variables.

Why is important to control for other variables?

Imagine that we want to know if people are willing to pay more for properties located on Beacon Street using the brookline .csv data.

Let's start by looking at the results of the univariate regression model:

reg1 = smf .ols(`price ~ beacon`, data = br)

results = reg1 .fit()

results .summary()

The interpretation of the slope coefﬁcient is that "properties located on Beacon Street, relative to properties not on Beacon Street, are sold on average for about $47,000 less". One may think that this is the effect of having a property located on Beacon Street on price. In other words, one may incorrectly interpret this result as the causal effect of Beacon Street on price. What the regression model reveals is that people are paying, on average, less for properties on Beacon Street, not that the location is the reason for price difference. From the regression we know that location and price are statistically associated, we don't know if location is cause of the price change.

A counter-argument to the initial interpretation of the beta coefﬁcient as causal effect would be the following: There are other factors that affect the price of properties; size, the style of the house, number of bedrooms, amenities, etc. I would not be surprised to ﬁnd out that these other factors are also related with location. I can imagine how construction companies, the passage of time and many other factors affect how buildings are constructed. It is very likely that the people and the time in which the buildings on Beacon Street were built may be essentially different from other locations and that's the reason why the buildings are different and people are paying less on average, not the location per se.

If one is interested in knowing if people are willing to pay more for properties located on Beacon Street one would need to compare properties that are: identical in all aspects related to price except for the location. That is, one would need to compare two apartments of the same size, same number of rooms, bedrooms, etc. One located on Beacon Street, and the other one located somewhere else and then compare the price. In that way we are controlling for other factors and allowing location to be the only determinant of the price difference.

Multivariate regression models allow us to do that by simply adding the other factors as additional independent variables, in which case we call them control variables. For instance, take a look at the results of a multivariate model of beacon on price, controlling for size:

reg2 = smf .ols(`price ~ beacon + size`, data = br)

results = reg2 .fit()

results .summary()

Note the drastic change in the coefﬁcient for beacon street compared to the univariate regression model. This observe this big change in the coefﬁcient because we are now estimating the relation between beacon street and price while controlling for size. In other words, when estimating the beta coefﬁcient for beacon we are keeping the size constant and only allowing only for beacon to vary. In this way we can guarantee that size is not going to be biasing (affecting) the estimation of the beta coefﬁcient for size.

We can control for other factors besides size. We simply need to add the additional controls to the model.

reg3 = smf .ols(`price ~ beacon + size + elevators`, data = br) results = reg3 .fit()

results .summary()

Next week we'll discuss how to decide which control variables to include and which

to omit.

Dummy Variables

A dummy variable is a variable that is coded as 0 or 1. The speciﬁc interpretation of 0 and 1 depends on the variable at hand. For instance, we can use 0 and 1 to represent if the individual has a college degree or not (0 = do not have a college degree, 1 = have a degree).

In a regression model, we represent categorical variables with the use of dummy variables. It is very easy to implement them in Python. You just need to specify the categorical variable using the C() within the formula of the smf .ols() function.

For example, let's run a regression using price as dependent variable and building style as independent variable:

reg1 = smf .ols('price ~ C(buildingStyle)', data = br) .fit()

reg1 .summary()

Take a look at the output. Each building style category (except for one) has it's own beta coefﬁcient (and standard errors, t-stats, etc.). What Python does under the hood is to create a dummy variable for each category (except one) and adds each dummy individually to the regression model. In this way you don't have to manually create the dummy for each category (if are curious about how to do it manually, take a look at the .get_dummies() function).

Interpretation of Beta Coefﬁcients for models with Dummy Variables

When you add a categorical variable with k > 1 categories to a regression model, you are representing the variable by adding k − 1 dummy variables. One of the categories is excluded and it is called the comparison category or default category. The interpretation of the intercept and the beta coefﬁcients of the categorical variable are relative to the comparison category. The reason for the exclusion of one category is to avoid a problem called perfect multicollinearity,

which is beyond the scope of this course. In short, if there is perfect multicollinearity the beta coefﬁcients cannot be estimated, to avoid this we exclude one of the categories.

Intercept: The intercept represents the average of the dependent variable (y) for the comparison category when all the other independent variables are equal to

zero.

Beta Coefﬁcients of a given category: The beta coefﬁcients represent the change in the average of the dependent variable (y) when the category changes from the comparison category to the one associated with the coefﬁcient. While keeping all other independent variables constant.

Example:

Let's start by estimating the following model:

reg1 = smf .ols('price ~ size + C(beacon)', data = br) .fit()

reg1 .summary()

Beacon is a dummy variable that is 1 when a property is located on Beacon Street, 0 otherwise. The comparison category by default is zero (not on beacon).

Intercept: The average price of properties that are not on Beacon Street is $ 6,981, when the size is equal to zero. (Makes no sense that the size is zero).

Beta Coefﬁcients of Size: An increase of 1 sq ft. is associated with an average increase of $ 409.42 in the price of apartments. Independent of the location relative to Beacon Street (this is the same as saying, keeping other independent variable when using a dummy).

Beta Coefﬁcients of Beacon: Properties located on Beacon street are $ 32,935.89 more expensive on average than properties that are not on Beacon Street. While keeping the size constant.

Displaying Regressions Side-By-Side

It is a good idea to display regression results from different models in a single table, this is especially useful to display comparisons among similar regressions.

In order to produce such a table we are going to extract a speciﬁc function called summary_col from the statsmodels .iolib .summary2 package, like this:

from statsmodels .iolib .summary2 import summary_col

And then use it on a set of estimated regressions, speciﬁed as a list:

reg1 = smf .ols('price ~ size', data = br) .fit()

reg3 = smf .ols('price ~ size + C(beacon)', data = br) .fit() reg2 = smf .ols('price ~ size + fullBathrooms', data = br) .fit() reg4 = smf .ols('price ~ size + C(beacon) + fullBathrooms', data = br) .fit()

print(summary_col([reg1, reg2, reg3, reg4]))

If you read the documentation of summary_col you can ﬁnd out many different options to make the table look exactly how you want. But, generally, you want to consider the following options:

summary_col([reg1, reg2, reg3, reg4], stars = True, float_format='%0 .2f', regressor_order = ['Intercept', 'size'])

• Stars: Setting this option to True will add a star ( * ) to identify variables that are statistically signiﬁcative. No star = not signiﬁcative, one star = signiﬁcative at p-value of 0.1, two starts = signiﬁcative at p-value of 0.05 (default) and three start = signiﬁcative at p-value of 0.01. Therefore, for our applications only variables with two starts or more are statistically signiﬁcative.

• float_format: Use this to specify the number of decimal places that you want to

be shown. '%0.2f' two decimal places, '%0.3f' three, '%0.4f' four, etc.

• regressor_order: This option can be used to determine which variable is shown at the top of the regression table. The variables will be displayed in the same order as in the list. You want to always have at 'Intercept' at the top, so that should be your ﬁrst input by default.

It is tradition to add models with fewer variables ﬁrst and the models with a large number of variables at the end.

Exporting Regression Results

Exporting a regression table can be achieved by using the following commands:

# Estimation

reg1 = smf .ols('price ~ size', data = br) .fit()

# Producing output table

regTable = summary_col([reg1, reg2, reg3, reg4] , regressor_order = ['Intercept', 'size'], stars = True)

# Exporting table to path

path =

'/Users/ccasso/Dropbox/BU/Teaching/2022/Fall/BA222/Data/brookline/reg Results .csv'

regTable .tables[0] .to_csv(path)

The key command is using .tables[0] .to_csv(path) from the results of summary_col() . Make sure you are specifying a path name that is different than your data.

2023-04-25

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言