COMP24112 Summative Exercise: Air Quality Analysis

发布时间：2024-06-18

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP24112 Summative Exercise: Air Quality Analysis (30 Marks)

This lab exercise is about air quality analysis, where you will predict air quality through solving classification and regression tasks. You will submit a notebook file, a pdf report, and a trained model. You will be marked for implementation, design, result and analysis. Your code should be easy to read and your report should be concise (max 600 words). It is strongly recommended that you use a LaTeX editor, such as Overleaf (https://www.overleaf.com/), to write your report.

Please note your notebook should take no more than 10 minutes to run on lab computers. There is 1 mark for code efficiency.

1. Dataset and Knowledge Preparation

The provided dataset contains measurements of air quality from a multisensor device. The device used spectrometer analyzers (variables marked by "GT") and solid state metal oxide detectors (variables marked by "PT08.Sx"), as well as temperature (T), relative humidity (RH) and absolute humidity (AH) sensors.

The dataset contains 3304 instances of hourly averaged measurements taken at road level in a polluted city. You will predict the CO(GT) variable representing carbon monoxide levels. There are missing features in this dataset, flagged by the number -999 .

You will need to pre-process the dataset to handle missing features, for which please self-learn from scikit-learn on how to impute missing values (https://scikit-learn.org/stable/modules/impute.html). You will need to split the dataset into training and testing sets, also to run cross validation, when you see fit. For this, please self-learn from scikit-learn on data splitting (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and cross validation (https://scikit-learn.org/stable/modules/cross_validation.html).

In [1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import time

import sklearn.model_selection

notebook_start_time = time.time()

# Import data - it should be saved in the same root directory as this notebook

sensor_data_full = pd.read_excel('sensor_data.xlsx')

# Display a sample of the data

sensor_data_full.sample(5)

2. Linear Classification via Gradient Descent (13 marks)

The air quality is assessed using the CO(GT) variable. If it is no greater than 4.5, the air quality is good (CO(GT)<=4.5), otherwise, it is bad (CO(GT)>4.5). You will perform binary classification to predict whether the air quality is good based on the other 11 varivables, i.e., from PT08.S1(CO) to AH.

2.1 Model Training and Testing (4 marks)

This practice is about training a binary linear classifier by minimising a hinge loss with L2 (ridge) regularisation, and then testing its performance. Given a set of training samples , where is the feature vector and is the class label for the -th training sample, the training objective function to minimise is

Here, is a column weight vector of the linear model, is the bias parameter of the model, and is the regularisation hyperparameter.

Recall from your lectures that gradient descent is an iterative optimisation algorithm typically used in model training. Complete the implmentation of the training function linear_gd_train below, which trains your linear model by minimising the above provided training objective function using gradient descent.

The function should return the trained model weights and the corresponding objective function value (referred to as cost) per iteration. In addition to the training data, the function should take the regularisation hyperparameter , learning rate , and the number of iterations as arguments. A default setting of these parameters has been provided below, which is able to provide reasonably good performance.

Note that scikit-learn is not allowed for implementation in this section. We recommend that you avoid using for loops in your implementation of the objective function or weight update, and instead use built-in numpy operations for efficiency.

In [11]:

def linear_gd_train(data, labels, c=0.2, n_iters=200, learning_rate=0.001, random_state=None # Add any other arguments here if needed

"""

A summary of your function goes here.

data: training data

labels: training labels (boolean)

c: regularisation parameter

n_iters: number of iterations

learning_rate: learning rate for gradient descent

Returns an array of cost and model weights per iteration.

"""

# Set random seed for reproducibility if using random initialisation of weights (optional)

rng = np.random.default_rng(seed=random_state)

# Create design matrix and labels

X_tilde = np.hstack((np.ones((data.shape[0], 1)), data))

y = labels

# Weight initialisation: use e.g. rng.standard_normal() or all zeros

w = rng.standard_normal(X_tilde.shape[1])

# Initialise arrays to store weights and cost at each iteration

w_all = []

cost_all = []

# GD update of weights

for i in range(n_iters):

margin = y * np.dot(X_tilde, w)

hinge_loss = np.maximum(0, 1 - margin)

# Cost and gradient update of the linear model

cost = c * np.sum(hinge_loss) + 0.5 * np.sum(w ** 2)

indicator = (margin < 1).astype(float)

# Weight update

w = w - learning_rate * (c * np.dot(X_tilde.T, indicator * y) + w)

# save w and cost of each iteration in w_all and cost_all

w_all.append(w)

cost_all.append(cost)

# Return model parameters.

return cost_all, w_all

def linear_predict(data, w):

"""

A summary of your function goes here.

data: test data

w: model weights

Returns the predicted labels.

"""

X_tilde = np.hstack((np.ones((data.shape[0], 1)), data))

y_pred = np.sign(np.dot(X_tilde, w[-1]))

y_pred[y_pred == -1] = 0

return y_pred.astype(int)

Now, you are ready to conduct a complete experiment of air quality classification. The provided code below splits the data into training and testing sets and imputes the missing features.

In [12]:

from sklearn.impute import SimpleImputer

# Put a threshold on the labels to cast to binary: True if CO(GT) > 4.5, False otherwise

binary_targets = (sensor_data_full['CO(GT)'] > 4.5).to_numpy()

sensor_data = sensor_data_full.drop(columns=['CO(GT)']).to_numpy()

# Named _cls to keep our classification experiments distinct from regression

train_X_cls, test_X_cls, train_y_cls, test_y_cls = sklearn.model_selection.train_test_split(sensor_data, binary_targets, test_size=0.15, strat

# Impute missing values and standardise the data

imputer = SimpleImputer(missing_values=-999, strategy='mean')

scaler = sklearn.preprocessing.StandardScaler()

train_X_cls = imputer.fit_transform(train_X_cls)

train_X_cls = scaler.fit_transform(train_X_cls)

Write your code below, which should train the model, plot the training objective function value and the classification accuracy of the training set over iterations, and print the classification accuracy and score of the testing set. Note, use the default setting provided for , and . Your plot should have axis labels and titles.

In [13]:

# Train the model

# Plot accuracy and cost per iteration on training set

# Apply imputation to the test set

# Predict on test set, report accuracy and f1 score

from sklearn.metrics import accuracy_score, f1_score

costs, weights = linear_gd_train(train_X_cls, train_y_cls, c=0.2, n_iters=200, learning_rate=0.001)

final_weights = weights[-1]

train_y_pred = linear_predict(train_X_cls, final_weights)

test_y_pred = linear_predict(test_X_cls, final_weights)

train_accuracy = accuracy_score(train_y_cls, train_y_pred)

test_accuracy = accuracy_score(test_y_cls, test_y_pred)

train_f1 = f1_score(train_y_cls, train_y_pred,zero_division=1)

test_f1 = f1_score(test_y_cls, test_y_pred,zero_division=1)

print(f"Train accuracy: {train_accuracy:.2f}, Test accuracy: {test_accuracy:.2f}")

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

plt.plot(range(len(costs)), costs, label='Objective function value')

ply.title('Objective function value per iteration')

plt.xlabel('Iteration')

plt.ylabel('Objective function value')

plt.legend()

plt.subplot(1, 2, 2)

train_accuracy = [accuracy_score(train_y_cls, linear_predict(train_X_cls, w)) for w in weights]

plt.title('Training accuracy per iteration')

plt.xlabel('Iteration')

plt.ylabel('Accuracy')

plt.legend()

plt.show()

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

Input In [13], in <cell line: 13>()

11 train_y_pred = linear_predict(train_X_cls, final_weights)

12 test_y_pred = linear_predict(test_X_cls, final_weights)

---> 13 train_accuracy = accuracy_score(train_y_cls, train_y_pred)

14 test_accuracy = accuracy_score(test_y_cls, test_y_pred)

15 train_f1 = f1_score(train_y_cls, train_y_pred,zero_division=1)

File ~/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:211, in accuracy_score(y_true, y_pred, normali

ze, sample_weight)

145 """Accuracy classification score.

146

147 In multilabel classification, this function computes subset accuracy:

(...)

207 0.5

208 """

210 # Compute accuracy for each possible representation

--> 211 y_type, y_true, y_pred = _check_targets(y_true, y_pred)

212 check_consistent_length(y_true, y_pred, sample_weight)

213 if y_type.startswith("multilabel"):

File ~/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:93, in _check_targets(y_true, y_pred)

90 y_type = {"multiclass"}

92 if len(y_type) > 1:

---> 93 raise ValueError(

94 "Classification metrics can't handle a mix of {0} and {1} targets".format(

95 type_true, type_pred

96 )

97 )

99 # We can't have more than one value on y_type => The set is no more needed

100 y_type = y_type.pop()

ValueError: Classification metrics can't handle a mix of binary and multilabel-indicator targets

2.2 Learning Rate Analysis (3 marks)

The learning rate (Greek letter "eta") is a key parameter that affects the model training and performance. Design an appropriate experiment to demonstrate the effect of on model training, and on the model performance during testing.

2.3 Report (6 Marks)

Answer the following questions in your report, to be submitted separately:

1. Derive step-by-step the gradient of the provided training objective function , and the updating equation of your model weights based on gradient descent. (3 marks)

2. What does the figure from section 2.1 tell you, and what is the indication of the classification accuracies of your training and testing sets? (1 mark)

3. Comment on the effect of on model training, and on the model performance during testing, based on your results observed in Section 2.2. (2 marks)

3. Air Quality Analysis by Neural Network (10 marks)

In this experiment, you will predict the CO(GT) value based on the other 11 variables through regression. You will use a neural network to build a nonlinear regression model. Familiarise yourself with how to build a regression model by mutlilayer perceptron (MLP) using the scikit learn tutorial (https://scikit-learn.org/stable/modules/neural_networks_supervised.html#regression (https://scikit-learn.org/stable/modules/neural_networks_supervised.html#regression)).

3.1 Simple MLP Model Selection (4 marks)

This section is focused on the practical aspects of MLP implementation and model selection. We will first compare some model architectures.

The set of MLP architectures to select is specified in param_grid below, including two MLPs with one hidden layer, where one has a small number of 3 hidden neurons, while the other has a larger number of 100 hidden neurons, and two MLPs with two hidden layers, where one is small (3, 3) and the other is larger (100, 100). It also includes two activation function options, i.e., the logistic and the rectified linear unit ("relu"). These result in a total of 8 model options, where sklearn default parameters are used for all the MLPs and their training.

In [ ]:

from sklearn.neural_network import MLPRegressor

from sklearn.model_selection import GridSearchCV

param_grid = [

{

'hidden_layer_sizes': [(3,), (100,), (3, 3), (100, 100)],

'activation': ['relu', 'logistic'],

]

Your code below should do the following: Split the dataset into the training and testing sets. Preprocess the data by imputing the missing features. Use the training set for model selection by cross-validation, and use mean squared error (MSE) as the model selection performance metric. You can use the scikit-learn module GridSearchCV (https://scikit-learn.org/stable/modules/grid_search.html#grid-search) to conduct grid search. Print the cross-validation MSE with standard deviation of the selected model. Re-train the selected model using the whole training set, and print its MSE and R2 score for the testing set.

In [ ]:

# Redo split for regression

# Prepare the data

# Define MLP model

# Initialise and fit the grid search

# Report the best parameters and the CV results

# Report model performance

3.2 Training Algorithm Comparison: SGD and ADAM (2 Marks)

In this exercise, you will compare two training algorithms, stochastic gradient descent (SGD) and ADAM optimisation, for training an MLP with two hidden layers each containing 100 neurons with "relu" activation, under the settings specified in test_params as below.

In [ ]:

test_params = [

{

'activation': 'relu',

'alpha': 0.001,

'early_stopping': False,

'hidden_layer_sizes': (100, 100),

'solver': 'adam'

},{

'activation': 'relu',

'alpha': 0.001,

'early_stopping': False,

'hidden_layer_sizes': (100, 100),

'learning_rate': 'adaptive',

'momentum': 0.95,

'solver': 'sgd'

]

Write the code below, where each training algorithm should run for 300 iterations (make sure to set early_stopping=False ). For both algorithms, (1) plot the training loss (use the defaul loss setting in sklearn), as well as the MSE of both training and testing sets, over iterations; and (2) print the MSE and score of the trained model using the testing set.

In [ ]:

# Train models and plot learning curves

# Print final test set performance for both models

3.3 Report (4 Marks)

Answer the following questions in your report, to be submitted separately:

1. What conclusions can you draw based on your model selection results in Section 3.1? (2 marks)

2. Comment on the two training algorithms based on your results obtained in Section 3.2. (2 marks)