MATH96007 - MATH97019 - MATH97097 Methods for Data Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Department of Mathematics
MATH96007 - MATH97019 - MATH97097
Methods for Data Science
Years 3/4/5
Coursework 1 – Supervised learning
Please read carefully the following instructions.
The goal of this coursework is to analyse two data sets using several tools and algorithms introduced in the lectures, which you have also studied in detail through the weekly Python notebooks containing the computational tasks.
You will solve the tasks in this coursework using Python. You are allowed to use Python code that you will have developed in your coding tasks. You are also allowed to use any other basic mathematical functions contained in numpy that you use to write your own code. Importantly, unless explicitly stated, you are not allowed to use for your solutions any model-level Python packages (e.g., sklearn, statsmodels, etc) or ready-made code found online.
Submission
The submission of your coursework will consist of two items:
● a Jupyter notebook (file format: ipynb) with all your tasks clearly labelled. You should use the template called SurnameCID_CW1.ipynb, which is provided on Blackboard.
The notebook should contain the cells with your code and their output, plus some brief text explaining your calculations, choices, mathematical reasoning, and discussion of results. (Note: Before submitting you must run the notebook, and the outputs of the cells printed.) You may produce your notebook with Google Colab, but you can also develop your Jupyter notebook through the Anaconda environment (or any local Python environment) installed on your computer.
● Once you have executed all cells in your notebook and their outputs are printed, you should also save the notebook as an html file, which you will also submit.
Submission instructions
The submission will be done online via Turnitin on Blackboard.
The deadline is Wednesday, 23 February 2022 at 4 pm.
You will upload two documents to Blackboard, wrapped into a single zip file:
1) Your Jupyter notebook as an ipynb file.
2) Your notebook exported as an html file.
You are also required to comply with these specific requirements:
● Name your zip file as ‘SurnameCID_CW!.zip’, e.g. Smith123456_CW!.zip. Do not submit multiple files.
● Your ipynb file must produce all plots that appear in your html file, i.e., make sure you have run all cells in the notebook before exporting the html.
● The notebook should have clear headings to indicate the answers to each question, e.g. ‘Task 1. 1’.
Note about online submissions:
● There are known issues with particular browsers (or settings with cookies or popup blockers) when submitting to Turnitin. If the submission 'hangs', please try another browser.
● You should also check that your files are not empty or corrupted after submission.
● To avoid last minute problems with your online submission, we recommend that you upload versions of your coursework early, before the deadline. You will be able to update your coursework until the deadline, but having these early versions provides you with some safety back up.
Needless to say, projects must be your own work: You may discuss the analysis with your colleagues but the code, writing, figures and analysis must be your own. The Department may use code profiling and tools such as Turnitin to check for plagiarism, as plagiarism cannot be tolerated.
Marks
The coursework is worth 40% of your total mark for the course.
This coursework contains a mastery component for MSc and 4th year MSci students.
Some general guidance about writing your solutions and marking scheme:
Coursework tasks are different from exams. Sometimes they can be more open-ended and may require going beyond what we have covered explicitly in lectures. In some parts of the tasks, initiative and creativity will be important, as is the ability to pull together the mathematical content of the course, drawing links between subjects and methods, and backing up your analysis with relevant computations that you will need to justify.
To gain the marks for each of the Tasks you are required to:
(1) complete the task as described;
(2) comment any code so that we can understand each step;
(3) provide a brief written introduction to the task explaining what you did and why you did it;
(4) provide appropriate, relevant, clearly labelled figures documenting and summarising your findings;
(5) provide an explanation of your findings in mathematical terms based on your own computations and analysis and linking the outcomes to concepts presented in class or in the literature;
(6) consider summarising your results of different methods and options with a judicious use of summary tables of figures.
The quality of presentation and communication is very important, so use good combinations of tables and figures to present your results, as needed.
Explanation and understanding of the mathematical concepts are crucial.
Marks will be reserved and allocated for: presentation; quality of code; clarity of arguments; explanation of choices made and alternatives considered; mathematical interpretation of the results obtained; as well as additional relevant work that shows initiative and understanding beyond the task stated in the coursework.
Code: Competent Python code is expected. As stated above, you are allowed to use your own code and the code developed in the coding tasks in the course. Copy-pasting code from other sources (e.g., online) is not allowed. You are expected to develop your own code for the specific tasks starting from your Python notebooks containing the coding tasks. You are not allowed to use Python packages like sklearn, statsmodels, etc unless explicitly stated.
Note that the mere addition of extra calculations (or ready-made 'pipelines') that are unrelated to the task without a clear explanation and justification of your rationale will not be beneficial in itself and, in fact, can also be detrimental if it reveals lack of understanding of the required task.
Coursework
In this coursework, you will work with two different data sets of high-dimensional samples:
● a data set of chemicals evaluated for toxicity
● a medical data set characterising tumour malignancy in cancer
You will perform a regression task with the former, and a binary classification task with the latter.
Task 1: Regression (45 marks)
Data set: Your first task deals with a modified chemistry data set that we have prepared based on molecular descriptors for an array of chemicals, each associated with a level of toxicity towards fish as measured in experiments with Pimephales promelas. (If you click the link you might understand why we thought it better to stick with the Greek name for the genus of this fish.) Each sample in the dataset (rows) corresponds to a chemical substance characterised by 10 features (molecular descriptors, columns). We will consider the toxicity level (column ‘LC50’) as the target variable to regress, while the other 10 variables are our predictors.
● This modified chemistry data set is made available to you on Blackboard as chemistry_samples.csv.
● We also provide on Blackboard a test set in the file chemistry_test.csv.
Important: The test set should not be used in any learning, either parameter training or hyper-parameter tuning of the models. In other words, the test set should be put aside and reserved as unseen, and only be used a posteriori to support your conclusions and to evaluate the out-of-sample performance of your models.
Questions:
Linear regression (10 marks)
1.1.1 - Use the data set chemistry_samples.csv to obtain a linear regression model to predict the toxicity factor LC50 as your target variable using all the other features as predictors. Report the inferred values of the model parameters and the in-sample R2 score for the data set.
1.1.2 - Apply the model to the test (unseen) data (chemistry_test.csv) to predict the target variable, and compute the out-of-sample R2 score on this test set. Compare the out-of-sample and the in-sample R2 score, and explain your findings.
Ridge regression (20 marks)
1.2.1 - Use the data set chemistry_samples.csv and repeat task 1.1.1 employing Ridge regression with 5-fold cross-validation to tune the penalty hyper-parameter of the model. Using the average mean squared errors (MSE) over all folds, demonstrate with plots how you scan the penalty hyper-parameter to find its optimal value. Report the optimal value for the penalty parameter and the performance of the model in terms of MSE. Using some of your computations, explain the trend of bias, variance and MSE as a function of the penalty hyper-parameter.
1.2.2 - Fix the penalty hyper-parameter to the optimal value found in 1.2.1 and retrain the model on the entire data set chemistry_samples.csv. Obtain the in-sample R2 score when applied to chemistry_samples.csv, and compare it to the out-of-sample R2 score on the test set chemistry_test.csv. Use some of your computations to discuss the differences between ridge regression and linear regression (Task 1.1).
1.3 Relaxation of Lasso regression (15 marks)
1.3.1 - In this task, you will implement a relaxation of the Lasso optimisation that can be solved using gradient descent. Consider a smooth version of Lasso in which the penalty term of the cost function is approximated through smooth functions Lc (Huber functions) so that it is differentiable. The cost function to be optimised is given by:
where p is the number of predictors, λ is the penalty hyper-parameter, and regulates the `sharpness’ of the Huber functions. Here we will fix c=0. 001 . A skeleton for the code of this task is provided within the SurnameCID_CW1.ipynb template on Blackboard.
1.3.2 - Use the data set chemistry_samples.csv and employ this relaxed Lasso-Huber regression with a 5-fold cross-validation (with the same folds as in 1.2.1) to conduct a grid search to find the optimal penalty hyper-parameter λ . Report the in-sample and out-of-sample performance of the model with the optimal penalty hyper-parameter, as in 1.2.2, using the R2 score.
1.3.2 - Discuss the differences in the regression coefficients obtained through Lasso (Task 1.3) and Ridge (Task 1.2) regressions.
Task 2: Classification (55 marks)
Data set: Your second task deals with the classification of breast tumour samples as ‘benign’ or ‘malignant’ based on 30 features. The column ‘DIAGNOSIS’ corresponds to the tumour classification where ‘B’ stands for ‘benign’ and ‘M’ for ‘malignant’ . The other 30 columns correspond to the features.
● The data set is available on Blackboard under file tumour_samples.csv.
● The test set is in the file tumour_test.csv.
● We also provide a balanced data set tumour_samples_bal.csv which is used in Tasks 2.3.3 and 3.2.2.
Important: The test set should not be used in any learning, either of parameter training or hyper-parameter tuning of the models. In other words, the test set should be reserved as unseen, and only be used a posteriori to support your conclusions and to evaluate the out-of-sample performance of your models.
Questions:
2.1 kNN classifier (10 marks)
2.1.1 - Train a k-Nearest Neighbour (kNN) classifier on the data set (tumour_samples.csv). Demonstrate that you have used a grid search with 5-fold cross-validation to find an optimal value of the hyper-parameter k.
2.1.2 - As in 1.2.2., fix the optimal k and retrain the model on the entire data set tumour_samples.csv. Use accuracy to compare the performance of your optimised classifier on the data set tumour_samples.csv to the performance on the test data set tumour_test.csv.
Random forest (20 marks)
2.2.1 - Train a random forest classifier on the data set tumour_samples.csv employing cross-entropy as your information criterion for the splits in the decision trees. Use the same 5-fold cross-validation subsets as in 2.1.1 to explore and optimise over suitable ranges the following hyper-parameters: (i) number of decision trees; (ii) depth of trees. Use accuracy as the measure of performance for this hyper-parameter optimisation.
2.2.2 - Compare the performance of your optimal random forest classifier on the data set tumour_samples.csv to the performance on the test data tumour_test.csv using different measures computed from the confusion matrix, in particular commenting on accuracy, recall and F1-score.
Support vector machine (SVM) (25 marks)
2.3.1 - Train a soft margin linear SVM classifier on the data set tumour_samples.csv using the same 5-fold cross-validation subsets as in 2.1.1 to optimise the hardness hyper-parameter that regulates the boundary violation penalty. Use accuracy as a measure of performance for this hyper-parameter optimisation. Display the accuracy of the SVM classifiers as the hardness hyper-parameter is varied, and discuss the limits of low hardness and high hardness.
2.3.2 - Evaluate the performance of the SVM classifiers obtained as the hardness hyper-parameter is varied by applying each of them to the test data tumour_test.csv. Represent your results using a receiver operating characteristic (ROC) curve. Use the ROC curve to discuss your choice of the optimal hardness hyper-parameter obtained in 2.3.1.
2.3.3 - Repeat tasks 2.3.1 and 2.3.2 but now training on the balanced data set tumour_samples_bal.csv. Using ROC curves (or other measures), compare and discuss the performance of SVM classifiers learnt from the (unbalanced) data set tumour_samples.csv obtained in Tasks 2.3.1 and 2.3.2 versus SVM classifiers learnt from the balanced data set tumour_samples_bal.csv obtained in Task 2.3.3.
Task 3: Mastery component (25 marks)
This task is to be completed by MSci (4th year) and MSc students.
Logistic regression and bagging (15 marks)
3.1.1 - Train a logistic regression classifier on the data set (tumour_samples.csv) with gradient descent for 5000 iterations, using a learning rate of 0.005. Measure the accuracy of the classifier using the decision threshold 0.5.
3.1.2 - Implement code that applies bagging to the training of the logistic regression classifier demonstrating that you have used a grid search with 5-fold cross-validation to choose the optimal number of bootstrap samples. Use accuracy as the measure of performance for the grid search, setting the decision threshold to 0.5, the gradient descent iterations to 5000, and the learning rate to 0.005, as in task in 3.1.1.
3.1.3 - Using the test data set tumour_test.csv, discuss the accuracy of the model achieved through bagging in 3.1.2 compared to the accuracy obtained in 3.1.1. To calculate the accuracy, use the decision threshold 0.5 as in 3.1.1.
Kernelised SVM classifier (10 marks)
3.2.1 - Starting from the code of the linear SVM (task 2.3.1), implement a soft margin kernelised SVM classifier with a Radial Basis Function (RBF) kernel.
3.2.2 - Fix the hardness hyper-parameter to the optimal value found in 2.3.3 and train the soft margin kernelised SVM classifier with RBF kernel on the data set tumour_samples_bal.csv, using the same 5-fold cross-validation subsets as in 2.3.3. Demonstrate that you have used a grid search with 5-fold cross-validation to find the optimal hyper-parameter of the RBF kernel. Use accuracy as a measure of performance for this hyper-parameter optimisation. Using appropriate measures, compare the results of the linear SVM in 2.3.3 to the results of the kernel SVM obtained here.
2022-02-17