关键词 > CS4001/4042

Programming Assignment CS4001/4042: Neural Networks and Deep Learning 2024

发布时间:2024-03-14

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Programming Assignment CS4001/4042:

Neural Networks and Deep Learning

Deadline: 18th March 2024

●    This assignment is to be done individually. You can discuss the questions with others but your submission must be your own unique work.

●    Data  files  and  other supporting  code for  both  parts are found  in the folder ‘Programming Assignment’ under ‘Assignments’ on NTULearn. You should use the  helper  codes  to  begin  and  provide  answers  to  the  assignment.  Please follow the formats given in helper codes to fill in your solutions. Those who do not   follow  the   specified  format  for  the   answers   risk   losing   marks   for presentation.

●    The assessment will be based on the correctness of the codes and solutions. Each part carries 45 marks, and 10 marks are assigned to the presentation and clarity of the solutions. The total number of marks is 100.

●    You  do   not  need  GPUs  for  this   assignment.   Local  PCs  will   be  sufficient. Attempt your assignments using Jupyter Lab / Notebook with version 3 and greater.

●    TAs  Mr.  Feng   Ruicheng  (RUICHENG002@e.ntu.edu.sg  ),  Ms.  Liang  Zhenxin ([email protected] ), and Mr. Liao Chang ([email protected]tu.edu.sg ) are in charge of this assignment.

●    For   any  inquiries,   please  either  contact  TAs   listed  above  or   post  in  the Discussion Board of NTULearn.

Submission procedure

●    Complete both parts A and B of the assignment and submit your solutions online via NTULearn before the deadline.

●    All submissions should be within the notebooks provided. Do not include data  nor model checkpoints in your submission. Submit 8 notebooks in this format:

o  For part A

.     <lastname>_<firstname>_Part_A_1.ipynb .     <lastname>_<firstname>_Part_A_2.ipynb .     <lastname>_<firstname>_Part_A_3.ipynb .     <lastname>_<firstname>_Part_A_4.ipynb .     common_utils.py

o  For Part B

.     <lastname>_<firstname>_Part  B  1.ipynb .     <lastname>_<firstname>_Part  B  2.ipynb .     <lastname>_<firstname>_Part  B  3.ipynb .     <lastname>_<firstname>_Part  B  4.ipynb

●    Late submissions will be penalized: 5% for each day up to three days.

Part A: Classification Problem

Part A of this assignment aims at  building  neural  networks to  perform  polarity detection from voice recordings, based on data in the National Speech Corpus, which is obtained from https://www.imda.gov.sg/how-we-can-help/national-speech-corpus

The   National  Speech  Corpus  is  an  initiative   by  the   Info-Communications  and   Media Development Authority, and it is the first large scale Singapore English corpus. Within the dataset, there  are  6  parts.  In  the  fifth  segment,  speakers  are  made  to  communicate  in several  different  styles,  including  Positive  Emotions  and  Negative  Emotions.  The  original recordings are approximately 20 minutes long. Using the librosa library, the recordings are split  into  shorter  segments  and  preprocessed  to   features  such  as  chromagrams,  Mel spectrograms, MFCCs and various other features.

The preprocessed CSV file is provided in this assignment. We will be using the CSV file named simplified.csv, which is both provided to you. The features from the dataset are engineered. The aim is to determine the speech polarity of the engineered feature dataset. The csv file is called simplified.csv with a row of 77 features that you can use, together with the filename. The “filename” column has the labels associated with them.


Question A1 (15 marks)

Design a feedforward deep neural network (DNN) which consists of three hidden layers of 128 neurons each with ReLU activation function, and an output layer with sigmoid activation function. Apply dropout of probability 0.2 to each of the hidden layers.

Divide the dataset into a 75:25 ratio for training and testing. Use appropriate scaling of input features. We solely assume that there are only two datasets here: training and test.

Use the training dataset to train the model for 100 epochs. Use a mini-batch gradient descent with ‘Adam’ optimizer with learning rate of 0.001, and batch size = 128. Implement early stopping with patience of 3.

Plot train and test accuracies and losses on training and test data against training epochs and comment on the line plots.

Question A2 (10 marks)

In this question, we will determine the optimal batch size for mini-batch gradient descent. Find the  optimal  batch  size  for  mini-batch  gradient  descent  by  training  the  neural  network  and evaluating the performances for different batch sizes. Note: Use 5-fold cross-validation on the training partition to perform hyperparameter selection. You will have to reconsider the scaling of the dataset during the 5-fold cross validation.

Plot mean cross-validation accuracies on the final epoch for different batch sizes as a scatter plot. Limit search space to batch sizes {64, 128, 256, 512, 1024}.

Next, create a table of time taken to train the network on the last epoch against different batch sizes.

Finally, select the optimal batch size and state a reason for your selection.

Question A3 (10 marks)

Find the optimal number of hidden neurons for the first depth and widths of the neural network designed in Question 1 and 2.

Plot the mean cross-validation accuracies on the final epoch for different combinations of depth and widths using a scatter plot. Limit the search space of the combos to {[64], [128], [256], [64, 64], [128, 128], [256, 256], [64, 128], [128, 64], [128, 256], [64, 256], [256, 128], [256, 64], [64, 64, 64], [128, 128, 128], [256, 256, 256], [64, 128, 256], [64, 256, 256], [128, 64, 64], [128, 128, 64], [256, 128, 64], [256, 256, 128]}. Continue using 5-fold cross validation on the training dataset.

Select the optimal combination for the depth and widths. State the rationale for your selection.

Plot the train and test accuracies against training epochs with the optimal combination of the depth and widths using   a  line plot.

[optional + 2 marks] Implement an alternative approach that searches through these combinations that could significantly reduce the computational time but achieve similar search results, without enumeration all the possibilities.

Note: use this optimal combination for the rest of the experiments.

Question A4 (10 marks)

In this section, we will explore the utility of such a neural network in real world scenarios.

Please use the real record data named ‘record.wav’  as a test sample. Preprocess the data using the provided preprocessing script (data_preprocess.ipynb) and prepare the dataset.

Do a model prediction on the sample test dataset and obtain the predicted label using a threshold of 0.5. The model used is the optimized pretrained model using the selected optimal batch size and optimal number of neurons.

Find the most important features on the model prediction for the test sample using SHAP. Plot the local  feature   importance  with   a  force   plot   and   explain   your   observations.   (Refer   to   the

documentation and these three useful references:

https://christophm.github.io/interpretable-ml-book/shap.html#examples-5,

https://towardsdatascience.com/deep-learning-model-interpretation-using-shap-a21786e91d16https://medium.com/mlearning-ai/shap-force-plots-for-classification-d30be430e195)

Part B: Regression Problem

This assignment uses publicly available data on HDB flat prices in Singapore, obtained from data.gov.sg on  17th August 2023. The original dataset is combined with other datasets to include more informative features and they are given in the ‘hdb_price_prediction.csv’ file.


Question B1 (15 marks)

Real world  datasets  often  have  a  mix  of  numeric  and  categorical  features  –  this  dataset  is  one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning1, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:

- Numeric / Continuous features: dist  to   nearest_stn, dist  to  dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm

- Categorical features: month, town, flat_model_type, storey_range

Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data. Do not use data from year 2022 and year 2023.

Refer to the documentation of PyTorch Tabular and perform the following tasks: https://pytorch- tabular.readthedocs.io/en/latest/#usage

●    Use DataConfig to define the target variable, as well as the names of the continuous and categorical variables.

●    Use TrainerConfig to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.

●    Use CategoryEmbeddingModelConfig to create a feedforward neural network with 1 hidden layer containing 50 neurons.

●    Use OptimizerConfig to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.

●    Use TabularModel to initialise the model and put all the configs together.

Report the test RMSE error and the test R2  value that you obtained.

Printout the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.

Question B2 (10 marks)

In Question B1, the default settings in PyTorch Tabular were enough to perform a quick experiment. In this part of the assignment, we will tryout a new model.

In  Question  B1,  we  used  the  Category   Embedding  model.  This  creates  a  feedforward  neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep- learning  problems combining  images, text, and tables. We will just  be  utilizing the deeptabular component of this library through the TabMlp network:

Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

Refer to the documentation of Pytorch-WideDeep and perform the following tasks:

https://pytorch-widedeep.readthedocs.io/en/latest/index.html

●    Use TabPreprocessor to create the deeptabular component using the continuous features and the categorical features. Use this component to transform the training dataset.

●   Create the TabMlp  model  with  2  linear  layers  in  the  MLP,  with  200  and  100  neurons respectively.

●   Create a Trainer for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the num_workers parameter to 0.)

Report the test RMSE and the test R2 value that you obtained.

P.S. It is a good practice to try out standard machine learning models before using deep learning techniques. XGBoost is a popular option for tabular datasets.

Question B3 (10 marks)

Besides ensuring that your neural network performs well, it is important to be able to explain the model’s decision. Captum is a very handy library that helps you to do so for PyTorch models.

Many  model explainability algorithms for deep  learning  models  are  available  in Captum. These algorithms are often used to generate an attribution score for each feature. Features with larger scores are more ‘important’2   and some algorithms also provide information about directionality (i.e.  a feature with very negative attribution scores means the larger the value of that feature, the lower the value of the output).

In general, these algorithms can be grouped into two

paradigms:

-          perturbation based approaches (e.g. Feature Ablation)

-          gradient / backpropagation based approaches (e.g. Saliency)

The former adopts a brute-force approach of removing / permuting features one by one and does not scale up well. The latter depends on gradients and they can be computed relatively quickly. But unlike how backpropagation computes gradients with respect to weights, gradients here are computed with respect to the input. This gives us a sense of how much a change in the input affects the model’s outputs.

First, load the dataset following the splits in Question B1. To keep things simple, we will limit our analysis to numeric / continuous features only. Drop all categorical features from the dataframes. Do not standardise the numerical features for now.

Follow  this  tutorial  to   generate  the   plot  from  various   model   explainability  algorithms

(https://captum.ai/tutorials/House_Prices_Regression_Interpret).

Specifically, make the following changes:

- Use a feedforward neural network with 3 hidden layers, each having 5 neurons. Train using Adam optimiser with learning rate of 0.001.

- Use Saliency, Input x Gradients, Integrated Gradients, GradientSHAP, Feature Ablation

Train  a  separate  model  with  the   same  configuration  but  now  standardise  the  features  via StandardScaler  (fit  to  training  set,  then  transform  all).  State  your  observations  with  respect to GradientShap and explain why it has occurred.

(Hint: Many gradient-based approaches depend on a baseline, which is an important choice to be made. Check the default baseline settings carefully.)

Read https://distill.pub/2020/attribution-baselines/3 to build up your understanding of Integrated  Gradients (IG). You might find the following descriptions and comparisons in Captum useful as well.

Then, answer the following questions in the context of our dataset:

- Why did Saliency produce scores similar to IG?

- Why did Input x Gradients give the same attribution scores as IG?

Question B4 (10 marks)

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In  such  situations,  the  distribution  of  the  new  data  points  could  differ  from  the  original  data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption  is violated,  model  performance  will  be  adversely  impacted.  In  the  last  part  of  this assignment, we will investigate to what extent model degradation has occurred.

Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

Evaluate your model from B1 on data from year 2022 and report the test R2.

Evaluate your model from B1 on data from year 2023 and report the test R2.

Did model degradation occur for the deep learning model?

Model degradation could be caused by various data distribution shifts4 : covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).

Using the Alibi Detect library, apply the TabularDrift function with the training data (year 2019 and before) used as the reference and detect which features have drifted in the 2023 test dataset. Before running the statistical tests, ensure you sample 1000 data points each from the train and test data.    Do    not    use    the    whole    train/test    data.     (Hint:    use    this    example    as    a   guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html).

Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?

From your analysis via TabularDrift, which features contribute to this shift?

Suggest 1 way to address model degradation and implement it, showing improved test R2  for year 2023.

Appendix A

Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2  dropped rapidly for 2022 and 2023.

Training set

Test set

Test R2

Year <= 2020

2021

0.76

Year <= 2020

2022

0.41

Year <= 2020

2023

0.10

Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

Training set

Test set

Test R2

2017

2018

0.90

 

2019

0.89

 

2020

0.87

 

2021

0.72

 

2022

0.37

 

2023

0.09

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

Training set

Test set

Test R2

2020

2021

0.81

2019

2021

0.75

2018

2021

0.73

2017

2021

0.72

References

[1] Koh JX, Mislan A, Khoo K, Ang B, Ang W, Ng C, Tan YY. Building the singapore english national speech corpus. Malay. 2019;20(25.0):19-3.

[2] Stein M, Schubert BM, Gruhne M, Gatzsche G, Mehnert M. Evaluation and comparison of audio chroma feature extraction methods. InAudio Engineering Society Convention 126 2009  May 1. Audio Engineering Society.

[3] Andersson T. Audio classification and content description. 2004.

[4] Miguel Alonso BD, Richard G. Tempo and beat estimation of musical signals. In

Proceedings of the International Conference on Music Information Retrieval (ISMIR), Barcelona, Spain 2004.