关键词 > COMP809

COMP809 Data Mining and Machine Learning ASSIGNMENT TWO Semester 1 - 2024

发布时间：2024-05-29

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ASSIGNMENT TWO

Semester 1 - 2024

PAPER NAME: Data Mining and Machine Learning

PAPER CODE: COMP809

TOTAL MARKS: 100

Part A: Literature Review [20 marks]

The objective of this part is to conduct a preliminary review of data mining and machine learning methods used in the estimation of Particulate Matter (PM10/PM2.5) concentration. The survey is intended to assist you in establishing a suitable framework (application area, tools, algorithms) on which your project will be based.

To achieve this objective, you need to follow the steps below:

1. Read and analyse recent peer-reviewed papers (minimum of 4 articles) on your specific topic.

2. From your research identify at least two themes and discuss these themes by comparing themto the various papers in your research. Some examples of themes are listed below:

a. Approaches/algorithms to solve the problem

b. Scientific results from experimentation

c. Perspectives on an issue

d. Advantages/disadvantages

3. Express your own opinions, e.g., new ideas, proposed approaches/models, how to extend the

existing work, etc. Your opinions about machine learning and data mining related issues should be presented.

4. Write the report using LaTex / Word. Minimum 4 pages (not including

references) and nomore than 5 pages in2 columns IEEE proceedings format.

Layout for Research Report

The research report must include:

. Title

. Abstract

. Introduction

. Background/motivation

. Comparison of related work (from peer-reviewed sources)

. Your opinion – new ideas, proposed approaches/models, how to extend the existing work

. Conclusion and future issues

. References

General Guidelines

. References must be clickable in the content of the report as well as the Reference list (important for efficient marking).

. The Section for “Your opinion” is a very important part of the report. In this section, you will present your own thoughts about the existing works. Based on this you will give your opinion on how the existing work can be improved further. Your opinions about data mining and machine learning related issues should be presented.

. NOTE: If you are using any material or figures in the assessment that is not your own, do remember to cite/reference the source.

. All assessments will be assessed through the Turnitin system and in case of plagiarism, theUniversity policy against plagiarism will be applied.

Part B: Predictions of Particulate Matter (PM2.5 or PM10) [80 marks]

Air pollution causes serious damage to public health and the environment; therefore, making accurate predictions of PM concentration is a crucial task. In this part, you are required to build prediction models based on regression model, multi-layer perceptron (MLP) and long short-term memory (LSTM).

Dataset

The dataset for this experiment can be downloaded from theEnvironmental Auckland Data Portal. The dataset includes a dependent variable (PM2.5 or PM10) and different predictors such as air pollutants, Air Quality Index (AQI), and meteorological data collected on an hourly basis from only one air quality monitoring station.

Two PM lag measurements, lag1 and lag2, should be included in your dataset. For example, lag1 for PM2.5 is the measurement for the previous hour (h-1) and lag2 is PM2.5 concentration for h-2.

Download relevant PM concentration, air pollution data (e.g., CO, SO2, NO, NO2), and meteorological data such as Air Temperature (°C), Relative Humidity (%), Wind Direction (°), and Wind Speed (m/s). The dataset should be hourly measurement starting from January 2020 to December 2022.

Note 1: Not all mentioned independent variables are collected at these monitoring stations.

Note 2: The unit of measurement for PM and air pollution data should be (μg/m3).

Data Pre-processing [6 marks]

Make sure your dataset all has the same temporal resolution (i.e. hourly measurement). Perform data exploration and identify missing data and outliers (data that are out of the expected range). For example, unusual measurements of the air temperature of 45 (°C) for Auckland, Relative Humidity measurements above 100, and negative or unexplained high concentrations are outliers.

. Provide attribute-specific information about outliers and missing data. How can these affect dataset quality?

. Based on this analysis, decide, and justify your approach for data cleaning. Once your dataset is cleaned move to the next step for feature selection.

Feature Selection [8 marks]

Choose the number of attributes so that they do not violate the model assumption(s)/ requirements. Using Pearson Correlation or any other feature selection method of your choice with justification.

. Provide the correlation plot (or results of any other feature selection method of your choice) and elaborate on the rationale for your selection.

. Describe your chosen attributes and their influence on PM concentration.

. Provide graphical visualisation of variation of PM concentration. Describe your observation

. Provide graphical visualisation of predictors of your choice that has the highest correlation. Describe your observation.

. Provide summary statistics of the PM concentration and predictors of your choice that has the highest correlation in tabular format.

Experimental Methods [5 marks]

Use 70% of the data for training and the rest for testing the models. Use a Workflow diagram to illustrate the process of predicting PM concentrations using the models. For all models, provide root mean square error (RMSE), Mean Absolute Error (MAE), and correlation coefficient (R2) to quantify the prediction performance of each model.

Regression Model

Fit a linear model to estimate the chosen PM concentration. [10 marks]

1) Provide the results including regression results, statistical significance metrics, and coefficients tables from this model.

2) Describe and analyse the results of your regression model. Are all the assumptions satisfied? Provide evidence to support your answer.

Multilayer Perceptron (MLP)

1) In your own words, describe multilayer perceptron (MLP). You may use one diagram in your explanation (one page). [5 marks]

2) Use the sklearn.neural_network.MLPRegressor with default values for parameters and a single hidden layer with k= 25 neurons. Use default values for all parameters and experimentally determine the best learning rate that gives the highest performance on the testing dataset. Use this as a baseline for comparison in later parts of this question. [5 marks]

3) Experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest accuracy. In part 2, we had all k neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer will have k- 1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons will be in the first layer with 2 in the second, and so on. [5 marks]

4) From the results in part 3 of this question, you will observe a variation in the obtained performance metrics with the split of neurons across the two layers. Give explanations for some possible reasons for this variation and which architecture gives the best performance. [5 marks]

Long Short-Term Memory (LSTM)

1) Describe LSTM architecture including the gates and state functions. How does LSTM differ from MLP? Discuss how the number of neurons and batch size affect the performance of the network. [5 marks]

2) To create the LSTM Model, apply Adaptive Moment Estimation (ADAM) to train the networks. Identify an appropriate cost function to measure model performance based on training samples and the related prediction outputs. To find the best epoch, based on your cost function results, complete 30 runs keeping the learning rate and the number of batch sizes constant at 0.01 and 4 respectively. Provide a line plot of the test and train cost function scores for each epoch. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time for each epoch. Choose the best epoch with justification. [5 marks]

3) Investigate the impact of differing the number of the batch size, complete 30 runs keeping the learning rate constant at 0.01 and use the best number of epochs obtained in previous step 2. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time for each batch size. Choose the best batch size with justification. [5 marks]

4) Investigate the impact of differing the number of neurons in the hidden layer while keeping the epoch (step 2) and Batch size (step 3) constant for 30 runs. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time. Discuss how does the number of neurons affect performance and what is the optimal number of neurons in your experiment? [5 marks]

Model Comparison

1) Plot model-specific actual and predicted PM concentration to visually compare the model performance. What is your observation? [3 marks]

2) Compare the performance of the models using RMSE. Which model performed better? Justify your finding. [3 marks]

Report Presentation [5 marks]