Empirical Exercise 1: Real Estate Valuation Using Regression Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Empirical Exercise 1: Real Estate Valuation Using Regression Analysis
Optional background reading:
1. The following papers walk through cases that contrast regression analysis with the use of more manual market comparables, when valuing a real estate property. The first paper also contains helpful information on how to implement a multiple regression analysis and an appendix on the basics of what a regression “does” with the data. If you are not very familiar with regression analysis, you should read them to prepare for the assignment below:
• Benjamin, J., Guttery, R., & Sirmans, C. F. (2004). Mass appraisal: an introduction to multiple regression analysis for real estate valuation. Journal of Real Estate Practice and Education
• James Frew & Beth Wilson (2002) Estimating the Connection between Location and Property Value, Journal of Real Estate Practice and Education
2. Moreover, the natural language processing task borrows estimated ``values’’ of different words from the following paper, which applies different machine learning techniques to listing text for houses in Los Angeles:
• Baur, K., Rosenfelder, M., & Lutz, B. (2023). Automated real estate valuation with machine learning models using property descriptions. Expert Systems with Applications, 213, 119147.
3. If you are interested in further applications of state-of-the-art text-based valuation methods, the following paper shows an interesting application of test-based analysis to measuring a home’s “uniqueness” :
• Shen & Ross (2021) Information value of property description. Journal of Urban Economics
Task Overview:
You are an associate at a real estate fund that is focused on buying condos and single-family homes and selling them after renovations and upgrading amenities. Two opportunities have come along for buying properties in New York and Los Angeles that - so you’ve been told - may be underpriced relative to the market right now.
Your fund emphasizes a data-driven approach and you want to come up with a rough estimate for what the current market value of those units would be, based on what other properties are selling for with key characteristics similar to those of your units. That is, you want to estimate the willingness to pay for valuable aspects of the units (# of bedrooms, commuting distance etc.) to get a first-pass estimate of the price that your fund should value the properties at.
You are working with data from Redfin, which was scraped from publicly available listings on the website on January 16, 2022 (see the Excel files provided with this assigment).
The characteristics of the units that you are considering are as follows:
• Los Angeles:
o Distance from Los Angeles City Hall: 12.45 miles
o Type: Single-family detached
o Bedrooms: 4
o Bathrooms: 4
o Living area: 2451 sqft
o Age: 84
• NYC:
o Distance from New York City Hall: 7.42 miles
o Type: Single-family detached
o Bedrooms: 3
o Bathrooms: 3
o Living area: 2006 sqft
o Age: 24
Please respond to the questions below in a group, using a slide presentation. You should submit a joint set of slides for your group to both Eli Wilson ([email protected]) and Matt Shintaku ([email protected]) by Tuesday, January 24th, at 5pm (please note the names of your group members and your class section on the first slide). Every member of your group should feel comfortable presenting or explaining your slides in class!
Data description:
The variables included in the Redfin data set that describe each property are labelled with self- explanatory headers, but here are some clarifications:
price: |
price of the unit in US$ |
square feet: |
total interior livable area in sqft |
distance: |
distance in miles from city hall |
beds: |
number of bedrooms |
description: |
listing text |
age: |
age of structure measured in years |
Detailed tasks:
1) Data preparation:
a) Open up the data sets and inspect the data. Check that you understand the format of all variables. If you want to see context for how these variables appear on Redfin, you can click on the included links to see the original listing (if it is still active).
b) Compute summary characteristics of houses in your sample (e.g. avg., largest/smallest values). Compare the overall characteristics of NYC and LA: What do you notice about average differences in the characteristics of the houses on the market in the two cities?
c) Based on these summary statistics, clean or limit the data (if necessary) to obtain a regression sample that is useful for estimating the value of different housing characteristics. Document and justify your data cleaning choices (even if you choose “none”).
2) Descriptive data exploration:
a) Plot house price per square foot in each city as a function of distance from City Hall (use a scatter plot with a line of best fit to show the pattern). What do you notice? How do the patterns in the two cities differ?
(Note: You may have to sort the data first by distance before plotting)
3) Regression:
a) To quantify the relationship between commuting distance to City Hall and house prices, we are going to estimate a house price gradient. We assume that the total house price Pi of property i is a function of its characteristics Xi of the following form:
ln Pi = acity + cityXi + ei
where ei is an unobserved error term, acity represents different average price levels in the two cities, and city captures the elasticity of prices with regard to the different housing characteristics, that is, the % response of house prices to a 1 unit change in each characteristic, holding the other characteristics included in the estimation constant. This is a linear equation so we can estimate it using ordinary least squares estimation (OLS).
Use a program of your choice, e.g. Excel, to estimate this relationship between house prices and housing characteristics in each city. You can include any characteristics Xi of your choice, but should at least estimate the effects of distance to City Hall, # of bedrooms, and # of bathrooms on the house price, and report your estimates of the coefficients city , as well as their standard errors. Try at least two different sets of characteristics – you can even get creative by interacting variables (i.e. estimate the effect of Xj ∗ Xm on house prices). Report your estimates.
Notes:
• Don’t forget to apply logarithmic transformations before your estimation where appropriate: ask yourself if you think that a one unit change in the variable would
affect prices proportionately (e.g. 1 bathroom -> X% higher price) or whether it should be that a proportional change has a proportional effect (e.g. Y% more bathrooms -> X% higher price). In particular, distance should enter the regression as log(distance).
• Use total house prices – not per square foot
• You can just show screenshots of regression output tables the way they are produced by Excel, for example, or your program of choice (R, Python, etc.). No need to waste time making these look pretty – it’s the insights that matter.
(Optional question: Why are we estimating this equation using the logarithm of price rather than the level?)
b) Comment on what we learn from these estimates, i.e. how large is the price response to
these characteristics in the two different cities? Use your casual knowledge of these cities, intuition about commuting costs and modes of transportation, or a bird’s eye view of the cities from Google Maps to explain the patterns: why might prices of characteristics differ between these cities in the way you observe? Provide only a short explanation of your best guess – we will discuss this topic further in class.
4) Out-of-sample price predictions:
a) Based on your preferred regression model, come up with a rough estimate of how much the two properties listed above should cost. That is, use your estimated effects and compute the predicted price for each property by summing over the value of its particular characteristics (in the right functional form!). Don’t forget to take into account any log transformations, or the constant acity . Come up with numbers in $US for each property.
b) Also value a property of your choice in L.A. using this method for which you know the approximately correct price (based on a recent transaction price known to you, or a listing on Redfin not in this data) Preferably choose a property where you feel comfortable providing pictures or a snapshot of the property’s Redfin or Zillow profile for presentation in class. How does this method do at valuing your chosen property? Why might this regression method not yield the right price? What do you think is missing from the valuation with regard to the property that you chose? What other public information could be used to get better at estimating the value of your chosen property?
5) Text analysis:
a) The rise of machine learning models has enabled the use of high-dimensional data, such as information based on listing text, in pricing models. While you do not have time to estimate a full natural language processing model that transforms the listing text into semantic tokens using transformers, you will build on existing research that has done so. The appendix to the Baur et al. (2023) paper cited above provides estimates of the most important words in L.A. house listings when trying to predict house prices. You are provided with a copy of these words and their estimated values in the second tab of the LA data file. Moreover, some of the LA listings have listing text associated with them. Where applicable, set up your data so as to automatically count which, if any, of these “important” words are contained in the provided descriptions.
b) For each listing with a listing text, create an overall “text information” score based on the contained “ important” words (take into account their positive / negative valence!) in whatever way you like – but justify your choice.
c) Re-estimate your preferred regression model from above, but now including your “text information” variable as a predictor. Does the information from the text help to improve the fit of the model?
d) Suggest some ways to improve this text analysis – what do you know or notice about the provided listings that could be useful in predicting prices (e.g. particular word usage for expensive properties)? Optionally, implement your suggestion to see if it improves the usefulness of the text information score.
2023-01-20