Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Math 122a

Final project: Option B

Due: Tuesday, 12/19

Part 1           Regression with clustering

Here we use clustering to improve the performance of a regression fit. We’ll use the Boston Housing data available in Python. Recall that the task here is to predict the median house in various neighborhoods, based on their characteristics. You can import the dataset directly using keras (although you won’t need keras for any other part of this problem):

1      from keras.datasets import boston_housing

2

3      (x_train, y_train), (x_test, y_test) = boston_housing.load_data()

The idea of this task is to improve the fit by first clustering the data, and training separate linear models on each cluster (instead of using a single linear model on the entire dataset).

(i) Use ridge regression as a baseline model: train a ridge regression model on the training data, and evaluate the mean squared error on the test data.

(ii) Now use k-means clustering to cluster the training data, using only the inputs x_train for the clustering and not the labels y_train. The reason we cluster using on the inputs is that we will want a model that can make predictions based only on test inputs x_test, without first seeing the labels y_test. Use k = 3 clusters. Visualize the clusters by projecting the data onto the plane and using a scatter plot (e.g. by plotting the first two variables of each data point).

(iii) Train three separate ridge regression models T1, T2, T3 , one for each cluster, with each model trained using only the data from the corresponding cluster. What is the total mean squared error on the test data clustered using the same centroids found for the training data?

Part 2            Open-ended exploration

Go beyond your findings in Part 1 to explore a question of interest to your group. For example, you could apply the method to a different dataset or propose a modified approach and compare your results. Prepare a short (4-8 minute) video sharing your findings. No particular format is required—be creative and try to have fun!