BA222 - Lecture Notes 10: Multivariate Regression Models
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BA222 - Lecture Notes 10: Multivariate Regression Models
Table of Contents
• Introduction
• Multivariate Regression Models
• Estimation in Python
• Interpretation of Beta Coefficients
• Controlling for Other Factors
• Dummy Variables
• Interpretation of Beta Coefficients for models with Dummy Variables
• Displaying Regressions Side-By-Side
• Exporting Regression Results
Introduction
In the previous lecture notes we introduced the linear regression model as a method to quantify the relation between two variables. Regression models are used for making policy decisions and forecasting.
The linear regression model is different from a simple linear equation because it includes an error term which represents all other factors that are related to y different than x.
In these notes you'll learn how to estimate regression models with more than one independent variable (more than one x variable). There are two main reasons to add more than one independent variable to the model:
• Forecasting: Generally, adding more variables to a regression model improves its predictive power. We'll learn how choose which variables improve predictive power and which variables don't.
• Policy Making: Adding more variables to a regression model can can reduce the bias that can be present in the estimation of key beta coefficients used for
policy making. The reason why the beta coefficients parameter may be biased has to do with the concept of confounding variables and we'll discuss it in detail next week.
In practice, regression models with a single x variables are seldom useful and are used generally as an exploratory first step estimation of a model. As a rule, you should always question statistical results derived from models specified in this way.
Multivariate Regression Models
A multivariate regression model is a regression model that includes more than one independent variable. Formally,
y = β0 + β1 x1 + β2 x2 + … + βk xk + error
Where x1 , x2 , … , xk are the independent variables. β0 is the intercept and β1 , β2 , … , βk are the beta coefficients associated to each respective independent variable.
In practice, we use multivariate regression models for two independent purposes:
• Improve the predictive power of a univariate regression model
• Reduce bias (more of this in the next lectures)
Estimation in Python
To estimate a multivariate regression model in Python is as simple as estimating a univariate model. As usual, start by loading packages and data, for this example we'll use the brookline .csv data:
import pandas as pd
import statsmodels .formula .api as smf
path =
'/Users/ccasso/Dropbox/BU/Teaching/2022/Fall/BA222/Data/brookline/bro okline .csv'
br = pd .read_csv(path)
The only difference with respect the estimation procedure of a univariate regression model is that in the formula term of the smf .ols() function we'll include more than one independent variable and we'll separate them using the symbol. For instance:
reg1 = smf .ols('price ~ size', data = br) .fit()
reg2 = smf .ols('price ~ size + bedrooms', data = br) .fit()
The first model, reg1 , is simply a univariate regression model where price is the dependent variable and size is the independent variable.
price = β0 + β1 size + error
The second model, reg2 , is a multivariate regression model where price is the dependent variable and size and bedrooms are each an independent variables.
price = β0 + β1 size + β2 bedrooms + error
For multivariate models, you can extract the values of the beta coefficients, residuals, fitted values and residuals, standard errors, R2 and produce the regression summary table in the same way as you do for univariate regression models.
# Extracting parameters
reg1 .params
reg2 .params
# Extracting standard errors
reg1 .bse
reg2 .bse
# Extracting rsquared
reg1 .rsquared
reg2 .rsquared
# Extracting fitted values
reg1 .fittedvalues
reg2 .fittedvalues
# Extracting residuals
reg1 .resid
reg2 .resid
# Extracting regression summary table
reg1 .summary()
reg2 .summary()
Interpretation of Beta Coefficients
In a multivariate regression model, the interpretation of the beta coefficients changes in an important way.
Intercept: The intercept (β0) of a multivariate regression represents the average of the dependent variable (y) when all the dependent variables (x1 , x2 , … , xk) are simultaneously equal to zero.
Beta Coefficients: The beta coefficients of the independent variables (βi , with i ≥ 1) represent the average change in the dependent variable (y) when the independent variable i increases by one unit, while keeping the other independent variables constant.
Let's practice the interpretation of coefficients using reg1 and reg2 :
Practice:
1. Estimate the beta coefficients for the regression model reg1 and interpret
the beta coefficients.
2. Estimate the beta coefficients for the regression model reg2 and interpret
the beta coefficients.
For reg1 the interpretation should be:
Intercept: The average price of a property in Brookline is about $ 12,934.12 if the size is of zero sq. ft. This makes no sense as properties cannot be of size zero.
Slope: For each additional one square foot increase in size, the average price of a
property in Brookline increases by about $ 407.45.
For reg2 the interpretation should be:
Intercept: The average price of a property in Brookline is about $ 12,934.12 if the size is of zero sq ft and the property has no bedrooms. This makes no sense as properties cannot be of size zero and it is unlikely that they have no bedrooms.
Beta Coefficient for Size: For each additional one square foot increase in size, the average price of a property in Brookline increases by about $ 409.79. While keeping the number of bedrooms constant.
Beta Coefficient for Bedrooms: For each additional bedroom, the average price of a property in Brookline decrease by about $ 1,894.33. While keeping the size constant.
Note that the interpretation for the number of bedrooms may be counter-intuitive, having more bedrooms reduce the price? Your intuition may tell you that it should be the opposite. Keep in mind that a multivariate regression models estimate the effect on the dependent variable while keeping the other independent variables constant. That is, if you have apartments of the same size, adding more bedrooms is associated with a reduced price.
Controlling for Other Factors
The great benefit of a multivariate regression model is that it allow the analyst to differentiate the effect of a particular x variable on y while controlling for other variables.
Why is important to control for other variables?
Imagine that we want to know if people are willing to pay more for properties located on Beacon Street using the brookline .csv data.
Let's start by looking at the results of the univariate regression model:
reg1 = smf .ols(`price ~ beacon`, data = br)
results = reg1 .fit()
results .summary()
The interpretation of the slope coefficient is that "properties located on Beacon Street, relative to properties not on Beacon Street, are sold on average for about $47,000 less". One may think that this is the effect of having a property located on Beacon Street on price. In other words, one may incorrectly interpret this result as the causal effect of Beacon Street on price. What the regression model reveals is that people are paying, on average, less for properties on Beacon Street, not that the location is the reason for price difference. From the regression we know that location and price are statistically associated, we don't know if location is cause of the price change.
A counter-argument to the initial interpretation of the beta coefficient as causal effect would be the following: There are other factors that affect the price of properties; size, the style of the house, number of bedrooms, amenities, etc. I would not be surprised to find out that these other factors are also related with location. I can imagine how construction companies, the passage of time and many other factors affect how buildings are constructed. It is very likely that the people and the time in which the buildings on Beacon Street were built may be essentially different from other locations and that's the reason why the buildings are different and people are paying less on average, not the location per se.
If one is interested in knowing if people are willing to pay more for properties located on Beacon Street one would need to compare properties that are: identical in all aspects related to price except for the location. That is, one would need to compare two apartments of the same size, same number of rooms, bedrooms, etc. One located on Beacon Street, and the other one located somewhere else and then compare the price. In that way we are controlling for other factors and allowing location to be the only determinant of the price difference.
Multivariate regression models allow us to do that by simply adding the other factors as additional independent variables, in which case we call them control variables. For instance, take a look at the results of a multivariate model of beacon on price, controlling for size:
reg2 = smf .ols(`price ~ beacon + size`, data = br)
results = reg2 .fit()
results .summary()
Note the drastic change in the coefficient for beacon street compared to the univariate regression model. This observe this big change in the coefficient because we are now estimating the relation between beacon street and price while controlling for size. In other words, when estimating the beta coefficient for beacon we are keeping the size constant and only allowing only for beacon to vary. In this way we can guarantee that size is not going to be biasing (affecting) the estimation of the beta coefficient for size.
We can control for other factors besides size. We simply need to add the additional controls to the model.
reg3 = smf .ols(`price ~ beacon + size + elevators`, data = br) results = reg3 .fit()
results .summary()
Next week we'll discuss how to decide which control variables to include and which
to omit.
Dummy Variables
A dummy variable is a variable that is coded as 0 or 1. The specific interpretation of 0 and 1 depends on the variable at hand. For instance, we can use 0 and 1 to represent if the individual has a college degree or not (0 = do not have a college degree, 1 = have a degree).
In a regression model, we represent categorical variables with the use of dummy variables. It is very easy to implement them in Python. You just need to specify the categorical variable using the C() within the formula of the smf .ols() function.
For example, let's run a regression using price as dependent variable and building style as independent variable:
reg1 = smf .ols('price ~ C(buildingStyle)', data = br) .fit()
reg1 .summary()
Take a look at the output. Each building style category (except for one) has it's own beta coefficient (and standard errors, t-stats, etc.). What Python does under the hood is to create a dummy variable for each category (except one) and adds each dummy individually to the regression model. In this way you don't have to manually create the dummy for each category (if are curious about how to do it manually, take a look at the .get_dummies() function).
Interpretation of Beta Coefficients for models with Dummy Variables
When you add a categorical variable with k > 1 categories to a regression model, you are representing the variable by adding k − 1 dummy variables. One of the categories is excluded and it is called the comparison category or default category. The interpretation of the intercept and the beta coefficients of the categorical variable are relative to the comparison category. The reason for the exclusion of one category is to avoid a problem called perfect multicollinearity,
which is beyond the scope of this course. In short, if there is perfect multicollinearity the beta coefficients cannot be estimated, to avoid this we exclude one of the categories.
Intercept: The intercept represents the average of the dependent variable (y) for the comparison category when all the other independent variables are equal to
zero.
Beta Coefficients of a given category: The beta coefficients represent the change in the average of the dependent variable (y) when the category changes from the comparison category to the one associated with the coefficient. While keeping all other independent variables constant.
Example:
Let's start by estimating the following model:
reg1 = smf .ols('price ~ size + C(beacon)', data = br) .fit()
reg1 .summary()
Beacon is a dummy variable that is 1 when a property is located on Beacon Street, 0 otherwise. The comparison category by default is zero (not on beacon).
Intercept: The average price of properties that are not on Beacon Street is $ 6,981, when the size is equal to zero. (Makes no sense that the size is zero).
Beta Coefficients of Size: An increase of 1 sq ft. is associated with an average increase of $ 409.42 in the price of apartments. Independent of the location relative to Beacon Street (this is the same as saying, keeping other independent variable when using a dummy).
Beta Coefficients of Beacon: Properties located on Beacon street are $ 32,935.89 more expensive on average than properties that are not on Beacon Street. While keeping the size constant.
Displaying Regressions Side-By-Side
It is a good idea to display regression results from different models in a single table, this is especially useful to display comparisons among similar regressions.
In order to produce such a table we are going to extract a specific function called summary_col from the statsmodels .iolib .summary2 package, like this:
from statsmodels .iolib .summary2 import summary_col
And then use it on a set of estimated regressions, specified as a list:
reg1 = smf .ols('price ~ size', data = br) .fit()
reg3 = smf .ols('price ~ size + C(beacon)', data = br) .fit() reg2 = smf .ols('price ~ size + fullBathrooms', data = br) .fit() reg4 = smf .ols('price ~ size + C(beacon) + fullBathrooms', data = br) .fit()
print(summary_col([reg1, reg2, reg3, reg4]))
If you read the documentation of summary_col you can find out many different options to make the table look exactly how you want. But, generally, you want to consider the following options:
summary_col([reg1, reg2, reg3, reg4], stars = True, float_format='%0 .2f', regressor_order = ['Intercept', 'size'])
• Stars: Setting this option to True will add a star ( * ) to identify variables that are statistically significative. No star = not significative, one star = significative at p-value of 0.1, two starts = significative at p-value of 0.05 (default) and three start = significative at p-value of 0.01. Therefore, for our applications only variables with two starts or more are statistically significative.
• float_format: Use this to specify the number of decimal places that you want to
be shown. '%0.2f' two decimal places, '%0.3f' three, '%0.4f' four, etc.
• regressor_order: This option can be used to determine which variable is shown at the top of the regression table. The variables will be displayed in the same order as in the list. You want to always have at 'Intercept' at the top, so that should be your first input by default.
It is tradition to add models with fewer variables first and the models with a large number of variables at the end.
Exporting Regression Results
Exporting a regression table can be achieved by using the following commands:
# Estimation
reg1 = smf .ols('price ~ size', data = br) .fit()
reg3 = smf .ols('price ~ size + C(beacon)', data = br) .fit() reg2 = smf .ols('price ~ size + fullBathrooms', data = br) .fit() reg4 = smf .ols('price ~ size + C(beacon) + fullBathrooms', data = br) .fit()
# Producing output table
regTable = summary_col([reg1, reg2, reg3, reg4] , regressor_order = ['Intercept', 'size'], stars = True)
# Exporting table to path
path =
'/Users/ccasso/Dropbox/BU/Teaching/2022/Fall/BA222/Data/brookline/reg Results .csv'
regTable .tables[0] .to_csv(path)
The key command is using .tables[0] .to_csv(path) from the results of summary_col() . Make sure you are specifying a path name that is different than your data.
2023-04-25