关键词 > COM4509/6509

COM4509/6509 Coursework Part 1

发布时间:2022-11-29

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COM4509/6509 Coursework Part 1

Hello, This is the first of the two parts. Each part accounts for 50% of the overall coursework mark.

What to submit

•     You need to submit two jupyter notebooks (not zipped together), named:

assignment_part1_ [username] .ipynb

assignment_part2_ [username] .ipynb

replacing [username] with your username, e.g. abc18de.

•     Please do not upload the data files used in this Notebook. We just want the two python notebooks.

Assessment Criteria

•     The marks are indicated for each part: You'll get marks for correct code that does what is asked and gets the right answer. These contribute 45.

•     There are also 5 marks for "Code quality" (includes both readability and efficiency).

Late submissions

We follow the department's guidelines about late submissions, Undergraduate handbook link. PGT handbook link.

Use of unfair means

"Any form of unfair means is treated as a serious academic offence and action may be taken under the Discipline Regulations." (from the students Handbook).

A dataset of air quality

We are going to use a dataset of air pollution measurements in Beijing archived by the UCI.

To read about the dataset visit the UCI page.

We are going to:

1.    Preprocess the data

2.    Build our own Lasso-regression

3.    Use sklearn's tools to perform regression on the data

import pandas as pd

import numpy as np

from sklearn import linear_model

from sklearn import linear_model

from sklearn .model_selection import GridSearchCV

from sklearn .ensemble import RandomForestRegressor

import urllib

import matplotlib .pyplot as plt

%matplotlib inline

We will be trying to predict the pollution (PM2.5) using:

•     temperature ('TEMP')

•     pressure ('PRES')

•     dew-point temperature ('DEWP')

•     precipitation ('RAIN')

•     wind direction ('wd')

•     wind speed ('WSPM')

urllib .request .urlretrieve( 'https://drive .google .com/uc?

id=1m1g4Xn1wMAGV_EU0Nh1HTI1ogA3-tqJk&export=download' , ' ./data .csv')

raw_df = pd.read_csv( 'data.csv',index_col= 'No')

#put the columns in a useful order

raw_df = raw_df[[ 'PM2 .5', 'year', 'month', 'day', 'hour', 'PM10',

'SO2', 'NO2', 'CO',

'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station']]

Some of the records are missing. We need to handle that before we can easily use the data with most ML tools.

Question 1: Removing missing data [1 mark]

We are going to handle this by dropping those rows which have a NaN in one of these columns: ['PM2.5','hour','TEMP','PRES','DEWP','RAIN','wd','WSPM'].

Save the result in nonull_df.

We can use df.dropna to do this. Pandas documentation on this method is here.

#Put answer here

nonull_df = ?

To check you've done it correctly, you could use clean_df .isnull().sum() to confirm that there are no NaN rows in the columns we're interested in:

nonull_df.isnull() .sum()

Question 2: Removing unwanted columns [1 mark]

Let's remove the columns we're not going to be using. We can use

nonull_df.drop(list_of_column_names, axis=1) to do this. We will drop: ['year','month','day','PM10','SO2','NO2','CO','O3','station'].

Again, feel free to check if it's worked with clean_df .isnull().sum(), for example.

clean_df = ? #Put answer here

#there should be 34284 rows left in your dataframe, and 8 columns

(note, this was corrected on 7/11/22)

clean_df .shape #=(34284, 8)

Question 3: Splitting the dataset [2 marks]

Before designing any machine learning model, we need to set aside the test data. We will use the remaining training data for fitting the model. It is important to remember that the test data has to be set aside before preprocessing.

Any preprocessing that you do has to be done only on the training data and several key    statistics need to be saved for the test stage. Separating the dataset into training and test  before any preprocessing has happened help us to recreate the real world scenario where we will deploy our system and for which the data will come without any preprocessing.

Later we will be performing a grid search to select parameter values. To do this we'll do     cross-validation, but rather than split the data into training and validation here we'll split it later. So for now we'll just split into:

•     The training (and validation) set will have 85% of the total observations,

•     The test set, the remaining 15%.

To avoid unwanted correlations connecting the training and test, we will split these by time. So:

•     Take the first 85% of the rows from clean_df and put them in train_df, take the remaining 15% of the rows and put them in test_df

#Put answer here

train_df = ?

test_df = ?

To check the sizes are correct, we can use:

len(train_df)/len(clean_df),len(test_df)/len(clean_df)

Detour: Lasso Regression

Later we will use the sklearn toolkit, but in this section you will develop your own code to do the Lasso regression.

Ordinary Least Squares Regression

First, let's just perform normal linear regression.

We'll use a toy design matrix & labels to use to check our code works. We'll also specify a weight vector too, for testing.

X = np .array([[0.0,0],[1,3],[2.2,3]])

y = np .array([0.0 ,1,2])

w = np .array([1.0,2])

Question 4: Prediction Function [2 marks]

The first task is to write a function to make predictions. Can you complete this function for linear regression, i.e. the predictions for all our points f ( X , w)=X w. [corrected: 7/11/22, don't need X tranposed]

def

prediction(X,w):

#Put answer here

return ?

#You can use this code to check you've written the right function

np .all(prediction(X,w)==np .array([0. , 7. , 8.2])) #Should return

'True'

Question 5: Objective Function [4 marks]

Now we need to write a function that returns the 'error'. We'll just do normal Ordinary Least Squares with linear regression, so if you remember the cost function for that is:

E =  ¿¿

Where E is the error, N the number of points, yi is one of the labels, xi is the input for that  label, w is the weight vector. f is the prediction function you've already written. Or feel free to substitute in xiT w .

def objective(X,y,w):

"""

Computes the sum squared error (for us to perform OLS linear

regression).

"""

#Put answer here

return ?

#You can use this code to check you've written the right function

objective(X,y,w)==74.44

Question 6: Objective Function Gradient [4 marks]

Now you need to find the derivative of the objective wrt the parameter vector. You've  already done this in lectures, so remember the gradient of the error function (for linear regression) is:

=2 X T X w  2 X T y

Add code to do this here:

def objective_derivative(X,y,w):

"""

Computes the derivative of the sum squared error, wrt the

parameters.

"""

#Put answer here

return ?

objective_derivative(X,y,w)

To check your gradients are correct, we can estimate the gradient numerically:

def numerical_objective_derivative(X,y,w):

"""

Computes a numerical approximation to the derivative of the sum

squared error, wrt the parameters.

"""

g = np .zeros_like(w)

for i,wi in enumerate(w):

d = np .zeros_like(w)

d[i]=1e-6

g[i] = (objective(X,y,w+d)-objective(X,y,w-d))/2e-6

return g

The two gradient vectors should be approximately equal:                                       objective_derivative(X,y,w),numerical_objective_derivative(X,y,w)

Question 7: Optimise w to minimise the objective [4 marks]

Now you need to use the gradient function you've written to maximise w using gradient descent. Start with a sensible choice of w. You'll need to loop lots of times (e.g. 1000). At each iteration: compute the gradient and subtract the scaled gradient from the w          parameter (you'll need to scale it by the learning rate, of e.g. 0.01).

def

"""

Returns the w that minimises the objective.

"""

#Put answer here

return ?

bestw = optimise_parameters(X,y,w)

print(bestw) #print our solution

Let's compare this to the answer provided by sklearn:

from sklearn import linear_model

clf = linear_model .LinearRegression(fit_intercept=False)

clf .fit(X,y)

print(clf .coef_) #matches the value of w we found above, hopefully!

Lasso Regression

Question 8: New Objective Function [3 marks]

We're now going to regularise the regression using L1 regularisation - i.e. Lasso regression. We need a new objective function:

E = ¿¿

This is similar to before (but the first term is now half the mean squared error, rather than the sum squared error). The second term is the L1 regularisation term.

def objective_lasso(X,y,w,alpha):

"""

Computes half the mean squared error, with an additional L1

regularising term. alpha controls the level of regularisation.

"""

#Put answer here

return ?

Question 9: The gradient of the lasso regression objective [3 marks]

The tricky bit the derivative of the objective.

The first part is similar to before. So, with the regularising term, the derivative is:

 = X T X w−X T y)+α sign(w)

where sign(w) returns a vector of the same shape as w with +1 if the element is positive and - 1 if it's negative. The np.sign method does this for you.

Have a think about why this is (think about what differentiating the 'absolute' function |wj | involves - think about what happens when it's positive vs when it's negative.

def objective_lasso_derivative(X,y,w,alpha):

"""

Returns the derivative of the Lasso objective function.

"""

#Put answer here

return ?

We can check it again, numerically. The two pairs of parameters should be the same:

def numerical_objective_lasso_derivative(X,y,w,alpha):

"""

This finds a numerical approximation to the true gradient

"""

g = np .zeros_like(w)

for i,wi in enumerate(w):

d = np .zeros_like(w)

d[i]=1e-6

g[i] = (objective_lasso(X,y,w+d,alpha)-objective_lasso(X,y,w-

d,alpha))/2e-6

return g

objective_lasso_derivative(X,y,w,0.1),numerical_objective_lasso_deriva

tive(X,y,w,0.1)

Question 10: Optimise w to minimise the Lasso objective [2 marks]

As before we need to optimise to find the optimum value of w , for this Lasso objective. You'll need to loop lots of times (e.g. 5000). Start with a sensible choice of w. At each   iteration: compute the gradient and subtract the scaled gradient from the w parameter (you'll need to scale it by the learning rate, of e.g. 0.05).

def optimise_parameters_lasso(X,y,startw):

"""

Returns the w that minimises the Lasso objective.

"""

#Put answer here

return ?

optimise_parameters_lasso(X,y,w)

We can check against the sklearn method:

clf = linear_model .Lasso(alpha=0.1,fit_intercept=False)

clf .fit(X,y)

print(clf .coef_)

The above result should approximately match the one you computed.

Back to air pollution

Question 11: One-hot-encoding [4 marks]

One of the columns isn't numerical, but instead is a string type: The wind direction. The best way to deal with this is one-hot-encoding.

pandas has a tool for doing this: pd .get_dummies(series,

prefix='prefix_to_use'). In our example the series is: clean_df .wd. You'll need to:

1.    Make the one-hot encoding table using the code above.

2.    Delete the wd column from our table (hint: you did this earlier for other columns).

3.    Join the one hot data to the table. To do this use something like dataframe1 .join(dataframe2).

def add_wd_onehot(df):

"""Add new one-hot encoding set of columns, removes the old column

it's made from. Returns new dataframe."""

#Put answer here

# Get one hot encoding of columns 'vehicleType'

# Drop column as it is now encoded

# Join the encoded df

return ?

#you could use this code to see if it's worked?

#train_df_wdencoded = add_wd_onehot(train_df)

#train_df_wdencoded

Question 12: Standardise the data [3 marks]

[note: updated from term 'Normalise' to 'Standardise' on 7/11/22. For clarity, I want the mean to be zero and the standard deviation to be one].

Now we need to standardise [edit: corrected] the data.

You could manipulate just some columns by using, for example: df.iloc[:,1:] - this returns a dataframe that consists of all but the first column.

Feel free to use either tools from sklearn.preprocessing or standardise [edit: corrected on 7/11/22] it by, for example, using the mean and the standard deviation of the columns   by calling some_dataframe .mean() or some_dataframe .std().

def standardise(df):

"""

Returns a new dataframe in which all but the PM2.5 columns are

standardised (i.e. have a mean of zero and standard deviation of 1)

[note: the function name used to be 'normalise' but was modified

for clarity]

[addition: 7/11/22

Think about if you want to standardise using the *training* data's

mean and standard deviation (for the test data).

"""

#Put answer here

return ?

train_df_preprocessed = standardise(add_wd_onehot(train_df))

test_df_preprocessed = standardise(add_wd_onehot(test_df))   Here we put the training and test inputs (X) and outputs (y) into four variables:

X = train_df_preprocessed .iloc[:,1 :] .to_numpy()

y = train_df_preprocessed .iloc[:,0] .to_numpy()

Xtest = test_df_preprocessed .iloc[:,1 :] .to_numpy()

ytest = test_df_preprocessed .iloc[:,0] .to_numpy()

We can use the same code we wrote before, using the Lasso from sklearn to fit the data.

Here we'll turn fit_intercept on, as we've not added a '1's column to our design matrix.

So feel free to use:

clf = linear_model .Lasso(alpha=0 .1,fit_intercept=True)

clf .fit(X,y)

Question 13: Finding the RMSE of the Lasso regressor predictions [2 marks]

Next compute the RMSE of the predictions for (a) the training data and (b) the test data. The RMSE (root mean squared error) could be computed, for example with:

np.sqrt(np .mean((predicted_values-true_values)**2))

#Put answer here

rmse_lasso_train = ?

rmse_lasso_test = ?

We can compare this to the standard deviation of the data, we should do better than that! np.std(y), np.std(ytest)

Question 14: Random Forest [8 marks]

The final step is to use a random forest regressor.

If we use the default random forest regressor, we find we get considerable over-fitting. So  we need to explore different parameters. We will use a cross-validated grid search over the parameters:

•     max_features: The number of features to consider when looking for the best split (i.e. controls subsampling),from 1 to the number offeatures in 4 steps (e.g. use    np.linspace)

•     n_estimators: The number of trees in the forest, from 10 to 100 in 4 steps.

•     max_samples : the number of samples to draw from to train each base estimator, from 0.1 to 0.9 in 4 steps.

We will use GridSearchCV.

Have a look at the documentation for this, the three parameters we need to specify are:

•     the 'estimator': an INSTANCE of RandomForestRegressor.

•     param_grid: a DICTIONARY, each item is the title of the parameter, and equals an array of the values we need to test. For exmaple, one of the items might be          {'max_samples': np.linspace(0.1, 0.9, 5)}.

•     You'll need to think carefully how to make the lists for the max_features and         n_estimators as these both need to be (positive) integers. E.g. use .astype(int).

#Put answer here:

#1. Create a grid of parameter values for n_estimators, max_features

and max_samples,

#2. Create a GridSearchCV object, using the random forest regressor:

#Note: Because there is so much training data, using the full dataset

takes too long. So here we'll just use 10%

np .random .seed(42)

idx = np .sort(np .random .choice(len(X), size=int(len(X)*0.1),

replace=False))

#3. Fit to training data in (the subset of) X and y

grid_regression .fit(X[idx,:],y[idx])

Here we print the best parameters from the grid search (on the training/validation cross- validation run):

best_n_estimators = grid_regression .best_params_ [ 'n_estimators']

best_max_features = grid_regression .best_params_ [ 'max_features']

best_max_samples = grid_regression .best_params_ [ 'max_samples']

print( 'Best n_estimator' , best_n_estimators)

print( 'Best max features' , best_max_features)

print( 'Best max samples', best_max_samples)

Question 15: RMSE for the Random Forest Regressor [1 mark]

Finally compute the RMSE for the training and test data:

#Put answer here

rmse_rf_train = ?

rmse_rf_test = ?

We can compare this to the standard deviations for the two sets of data.

np.std(y), np.std(ytest)

Question 16: Did the random forest do better than lasso regression? [1 mark]

#Put answer here