Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

#load the main required packages for this assignment


import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

Fundamentals of Statistics

In this notebook, you review a dataset about houses prices in a US county. You make a normal linear regression model to explain and predict the housing price using the proper predictor variables, and you check the model assumptions and goodness of its fit.

Data

This dataset contains house sale prices for King County (link). "King County is located in the U.S. state of Washington. The population was 2,269,675 in the 2020 census, making it the most populous county in Washington, and the 13th-most populous in the United States. The county seat is Seattle, also the state's most populous city." The dataset includes information on houses sold between May 2014 and May 2015 in this county.

The dataset variables are described below, but we will not be using all of them:

· id - Unique ID for each home sold

· date - Date of the home sale

· price - Price of each home sold based on dollars (response variable)

· bedrooms - Number of bedrooms

· bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower

· sqft_living - Square footage of the apartments interior living space

· sqft_lot - Square footage of the land space

· floors - Number of floors

· view - An index from 0 to 4 of how good the view of the property was

· condition - An index from 1 to 5 on the condition of the apartment,

· grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

· sqft_above - The square footage of the interior housing space that is above ground level

· yr_built - The year the house was initially built

· sqft_basement - The square footage of the interior housing space that is below ground level

· waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not

· yr_renovated - The year of the house’s last renovation

· zipcode - What zipcode area the house is in

· lat - Lattitude

· long - Longitude

· sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors

· sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Read in the dataset by running the following code.

In [ ]:

housedata = pd.read_csv("kc_house_data.csv")

housedata

In [ ]:

#See the type, values and range of variables:

housedata.describe()

Run the code below

It makes a new variable/column in the dataset that calculates the age of the houses in 2015.

In [ ]:

housedata['Age'] = housedata['yr_built'].max() - housedata['yr_built']

housedata['Age'].describe()

Run the code below

· It makes a new variable/column named logprice that saves logarithms of the price variable/column in the dataset

· It plots two histograms for price and logprice. The command fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,6)) is used to present the two histograms side by side.

In [ ]:

housedata["logprice"] = np.log(housedata["price"])


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,6))

fig.suptitle('Histograms of Price and Log-price')

ax1.hist(housedata["price"], bins=20)

ax1.set_xlabel('price')

ax2.hist(housedata["logprice"], bins=20)

ax2.set_xlabel('log-price')

plt.show()

Question 1 (10 marks)

We want to fit a normal linear regression model to this data in which the response variable is the houses prices. After looking at the abow plots, we decide to use logprice as the response variable of the model instead of price. Why?

1. Because the price numbers are very large and it is better to use smaller numbers.

2. Because the response variable must be approximately normally distributed and bell shaped. So we can fit the model using the transformed new response variable logprice.

3. Because the distribution of the price values is left-skewed and not bell-shaped.

4. Because logarithm of prices are more informative in the model compared to the price values.

In [ ]:

# YOUR CODE HERE

#raise NotImplementedError()


# assign the number (1,2,3,4) of the correct option to object answer_q1; for example answer_q1 = 4

# the solution doesn't need any code


answer_q1 =


In [ ]:

assert isinstance(answer_q1, int)


Run the code below

· It makes a smaller data frame with only some of the dataset variables. It includes the response variable logprice and other variables that we are interested in and want to consider as "predictors" in a normal linear regression model.

· it calculates the pearson correlation between every pairs of those variables and plots them in a "heatmap".

Read the correlation between sqft_living and sqft_above. We decide to only use one of these two predictors in building our model (let's say sqft_living). Because these two predictors have a very strong linear correlation and provide almost the same information (when one of them is high the other one is high too and vice versa).

In [ ]:

# make a subset of data

housedata_vars = housedata[["logprice", "bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors",

"view", "condition", "grade", "sqft_above","sqft_living15", "Age"]]

# calculate correlations

corr = housedata_vars.corr().abs()

#plotting a heatmap

fig, ax=plt.subplots(figsize=(10,6))

fig.suptitle('Variables Correlations')

sns.heatmap(corr, cmap="Blues", xticklabels=corr.columns, yticklabels=corr.columns, annot=True)

plt.show()

Run the code below

It makes scatterplots of the response variable and some of the quantitative/discrete value predictors.

In [ ]:

sns.pairplot(data=housedata,

y_vars=['logprice'],

x_vars=["sqft_living", "sqft_lot", "sqft_living15", "Age"])

plt.show()

Question 2 (10 marks)

Which variable has the strongest linear relationship with the response variable?

1. sqft_living

2. sqft_lot

3. sqft_living15

4. Age

In [ ]:

# YOUR CODE HERE

#raise NotImplementedError()


# assign the number (1,2,3,4) of the correct option to object answer_q2; for example answer_q2 = 4

# the solution doesn't need any code


answer_q2 =



In [ ]:

assert isinstance(answer_q2, int)


Run the code below

It makes boxplots of some of the categorical/ordinal variables of interest. On y axes, the response variable's values are given, and on x axes the different categories of the predictors.

In [ ]:

fig, axs = plt.subplots(ncols=6, figsize=(30,6))

sns.boxplot(x="floors", y="logprice", data=housedata, ax=axs[0])

sns.boxplot(x="view", y="logprice", data=housedata, ax=axs[1])

sns.boxplot(x="condition", y="logprice", data=housedata, ax=axs[2])

sns.boxplot(x="grade", y="logprice", data=housedata, ax=axs[3])

sns.boxplot(x="bedrooms", y="logprice", data=housedata, ax=axs[4])

sns.boxplot(x="bathrooms", y="logprice", data=housedata, ax=axs[5])

plt.show()


Question 3 (10 marks)

Which two categorical variables have the weakest linear relationships with the response?

1. "condition", "view"

2. "bedooms", "grade"

3. "view", "bathrooms"

4. "floors", "condition"

In [ ]:

# YOUR CODE HERE

#raise NotImplementedError()


# assign the number (1,2,3,4) of the correct option to object answer_q3; for example answer_q3 = 4

# the solution doesn't need any code


answer_q3 =



In [ ]:

assert isinstance(answer_q3, int)


Run the code below

These predictors (sqft_living, sqft_living15) have observed values that are much larger compared to the response values and this can cause numerical issues in the regression model. So we transform them too, by taking the logarithm of them and save the new variables in the dataset. (There are several other transformations we could try in such a situation. For example, we could standardise them instead - by subtracting the mean and dividing over the standard deviation; or centralising them by subtracting the mean).

In [ ]:

housedata["log_sqft_living"] = np.log(housedata["sqft_living"])

housedata["log_sqft_living15"] = np.log(housedata["sqft_living15"])


Run the code below

· It fits a normal linear regression model using ols function, in which logprice is the dependant response variable and log_sqft_living, log_sqft_living15, view, grade, bedrooms, bathrooms are independant predictors.

· The model is named reg_model.

· See the summary of the model's output.

In [ ]:

from statsmodels.formula.api import ols

reg_model = ols('logprice ~ log_sqft_living + log_sqft_living15 + view + grade + bedrooms + bathrooms', data=housedata).fit()

reg_model.summary()

Run the code below

This code

· makes a QQ plot of the model residuals.

· makes a plot of the model residuals against the fitted/predicted values.

In [ ]:

# qqplot

from statsmodels.api import qqplot

qqplot(data=reg_model.resid, fit=True, line="45")

plt.show()

In [ ]:

# this runs a bit slowly because of the large sample of the data

# residuals/fitted values plot

plt.figure(figsize=(8,5))

sns.residplot(x=reg_model.fittedvalues, y=reg_model.resid, lowess=True, line_kws={'color': 'red'})

plt.title('Residuals Scatterplot')

plt.xlabel("Fitted values")

plt.ylabel("Residuals")

plt.show()


Question 4 (10 marks)

What does the plot of residuals againt fitted values describe about the model?

1. The model's residuals are normally distributed.

2. According to the random pattern of the plot, the linearity assumption is not violated.

3. The average value of the residuals is negative.

4. The variation of the residuals is larger for bigger fitted values.

In [ ]:

# YOUR CODE HERE

#raise NotImplementedError()


# assign the number (1,2,3,4) of the correct option to object answer_q4; for example answer_q4 = 4

# the solution doesn't need any code


answer_q4 =


In [ ]:

assert isinstance(answer_q4, int)


Question 5 (10 marks)

What is a typical error (or difference between a prediction and an observed response) in this model? Write the answer with four decimal places.

In [ ]:

# YOUR CODE HERE

raise NotImplementedError()



#  assign the value to answer_q5; for example answer_q5 = 106.1234


answer_q5 =


In [ ]:

assert isinstance(answer_q5, float)



Run the code below

It calculates the means of log_sqft_living and log_sqft_living15 and saves them.

In [ ]:

mean_log_sqft_living= np.mean(housedata["log_sqft_living"])

mean_log_sqft_living15= np.mean(housedata["log_sqft_living15"])

Question 6 (20 marks)

We would like to use this model to predict the price of an almost average house in this area. Apply the fitted model and predict the house price of a house that has

· log_sqft_living = mean_log_sqft_living

· log_sqft_living15 = mean_log_sqft_living15

· view = 2

· grade = 8

· bedrooms = 3

· bathrooms = 2

Write your code below and print out the final prediction for the house price based on dollars.

In [ ]:

# YOUR CODE HERE

raise NotImplementedError()



Question 7 (30 marks)

Write one paragraph (not more than 250 words) and explain your understanding and insights from this model about house prices in this area and how trustworthy they are. Type your text in a new markdown cell below.

In [ ]: