Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assessment 2 Coursework Project

Purpose

The purpose of this assignment is to develop your skills in analysing a dataset, and your ability to present your findings in a technical report.

Task overview

This assessment is worth 80% of your mark. The assessment combines two objectives in one report that encompasses:

1.  creating a data model to analyse the data

1.  writing a technical report about your analysis.

You will use Jupyter notebook to conduct both activities in tandem and you will submit your notebook via Minerva.

Below are the details of each objective. For this assignment you may only work individually.

Individual project: two techniques on a dataset

The main task is to analyse a small-medium dataset related to an industrial scenario. You will need to apply two techniques on the dataset for the problem in hand, whether it is prediction, classification or clustering. Pre-processing for the dataset is required, and below you will be given an indication of which pre-processing techniques to apply. You will use Jupyter notebook to conduct your analysis and write your technical report in markup. You will use Python code, withonlythe following packages permitted: sklearn, numpy, matplotlib and pandas.

You may also wish to use RapidMiner to conduct a prototype process to solve the problem. This step is not mandatory, but it is highly recommended. Prototyping can help you to cross check your analysis on standardised operators and guide the development of your project, and will act as an indicator of errors in the code. It also can be used to show the overall schematic structure of your analysis which can be helpful.

Technical report

A narration around the analysis that you have conducted must accompany your code inside your Jupyter notebook. So the notebook will act as a technical report that has code embedded in it. Your report will be based on your understanding of the covered topics with the results of the analysis. A guidance on the number of words for each section is provided within a Jupyter notebook template that you must utilise. The written element of the technical report, not including code or images, should consist of about 2000 words.

Justifications of the choices of the techniques are essential and constitutes an intrinsic part of your report. The report must lay out the key decisions that were made when assessing and evaluating the solutions for the case study and its related problem. You must also include visualisation of the analysis and testing of the dataset with performance metrics (such as confusion matrix, F1, recall Balanced Accuracy etc.) with appropriate figures that fit your analysis.

The case study

An insurance company plans to utilise their historic insurance fraud dataset to predict the likelihood or the level of risk a customer poses. You can find the dataset above. Referring genuine claims cause customer stress and directly leads to customer loss, costing the company money (assume that any referred non-fraud case will lead to losing that customer). While obviously, fraud claims cost the company as well. Their main requirement is to use an unbiased predictive model capable of flagging and referring potential fraud cases for further investigation with a balanced error rate of 5% (you might or might not be able to achieve such performance)

The data analysis model

Your task is to devise a data analysis model using Jupyter notebook and python code that is capable of satisfying the client’s needs as much as possible. Use suitable metrics in order to report your results accordingly. Refer to the textbook by Tan et al. (2020) to help you decide on these issues. When working on the task, you need to take into account as many of the provided features and do some merging of different files to have a unified dataset.

You need to find a way to quantify the loss incurred through losing customers because of your recommended model prediction error. For that, you need a pricing model. Take the average of all claims and assume the company gross profit must amount to double the claims to cover the overhead of running the business as well as the claims and profit. Assume also that they have a 10% rate of claims in their customer base to find how many customers they have on average and how much they would need to charge for each policy. Based on this pricing model, calculate how much your model will cost them due to its error to offer them an insight into how good or bad your model will be for their business.

1- You need topre-process the datasetand its attributes.

a. Remove noisy attributes that do not contribute to the classification problem if necessary.

b. Remove synonymous attributes if necessary.

c. Appropriate feature selection or feature extraction, if necessary

d. Deal with collinearity if necessary

e. Rescale the attributes if necessary

f. Deal with missing and duplicate values if necessary

g. Dealing with class imbalance

h. other

Note that some of those operations might need to be pipelined in a later stage during hyperparameter optimisation or model performance comparison.

2-Checkif there is a class-imbalance problem by conducting basic statistics and/or clustering on the dataset and propose a solution.

3-Build/Train your models with two techniques:apply your two prediction techniques and analyse the results.

a. Make sure that your techniques do not over-fit the training set.

b. If you work in a group of two, you must use four techniques (two each) to conduct the analysis and comparisons. If you work individually, you can increase the number of techniques to more than two if you wish.

4-Cross-validate your modelif necessary. Split the data appropriately and use appropriate performance metrics and compare the results with and without CV for both of your techniques.

5-Test your techniques.Split the data appropriately and use appropriate performance metrics and compare the results with the training performance and deal with overfitting if necessary. We would like to help our client to identify fraudulent claims as much as possible under the constraint they have specified. Therefore, make sure to use appropriate performance metrics that reflect this interest.

6-Present and format your workaccording to the sections mentioned below.

Report structure and marking scheme

Your report should address the sections and subsections in the marking scheme below. You can add a table of contents to the provided template. Each section and subsection below must correspond to the same section with the same title in your Jupyter notebook. The template provided already satisfies this requirement.

1. Aims, Objectives and Plan

· Aim and Objectives.

· Plan: Simple Gantt chart

4 marks

2. Understanding The Case Study

· Case Study Analysis (state the key points that you found in the case) and how you intended to deal with them appropriately to address the client needs.

4 marks

3. Pre-Processing Applied

· Preparing the labels appropriately

· Appropriate feature selection if necessary

· Appropriate feature extraction if necessary

· Dealing with collinearity if necessary

· Dealing with missing if necessary

· Dealing with duplicate values if necessary

· Rescaling if necessary

· Dealing with class imbalance if necessary

· Other necessary pre-processing

20 marks

4. Technique 1

· Motivation for choosing the technique and schematic figure of the analysis process

· Setting hyper parameters (rationale)

· Optimising the hyper parameters appropriately

· Performance metrics for training set

· Other items necessary for the technique

20 marks

5. Technique 2

· Motivation for choosing the technique and schematic figure of the analysis process

· Setting hyper parameters (rationale)

· Optimising the hyper parameters appropriately

· Performance metrics for training set

· Other items necessary for the technique

20 marks

6. Comparison of Metrics Performance for Testing

· Use of cross validation for both techniques to deal with overfitting model selection and model comparison

· Use appropriate metrics for testing set

· Use appropriate model selection visualisation curve (ROC, PR etc.) that is suitable for the problem in hand

16 marks

7. Final Recommendation of Best Model

· Technical perspective- overfitting discussion, complexity and efficiency

· Business perspective- results interpretation, relevance and balance with technical perspective

8 marks

8. Conclusion: Self Reflection and Future Work

· What has been successfully accomplished and what has not

· Reflect back on the analysis and see what you could have done differently if you were to do the project again

· Add a wish list of future work that you would do to take the project forward

8 marks

Total marks

100%

Explaining your report

In order to ensure that you understand all of your submitted coding, you should be able to explain what you have done to someone else. To assist the Module Leader with marking your report, as well as your Jupyter notebook you should also upload a short video (2-3 minutes long) explaining the main elements of your project, what was successful and what was not successful which requirements you have fully met and which one you did not.

This video should be a screen-capture video with your voiceover. Your explanation should be as per the sections of the report, given above.

You will not receive specific marks for your video, but this is a necessary part of your overall assessment submission.

Presentation

Your report should adhere to the guidelines on presentation of coursework as set out in Section 3.2a of theCode of Practice on Assessment.

Submission

You will need to submit your report as a Jupyter notebook (.ipynb) file and submit your explanation video as a .mp4 or .mov file. Other video file formats may be acceptable if approval is sought from the Module Leader.

Please be aware that video files and Juypter files are typically larger and therefore may take some time to upload.Please leave at least 10 minutes to allow your file to submit correctly.

The size of any file for submission must not exceed 100 MB.

To decrease the file size, we recommend that you zip everything together and submit one zipped file. Please do not try to submit individual files because the file size will be too large. Guidance on how to create a zipped file can be found here:https://desystemshelp.leeds.ac.uk/minerva-student/assessment-and-grades/zip-your-files-for-submission-in-minerva/

The title of your zip file should include your student ID number and should be entered as the ‘submission title’ when you upload your report and your video to the submission inbox in Minerva.

Submit your files by the due date and time using the submission link found under"Submit my work".

IMPORTANT:Please be aware that once you have uploaded your report and video, you will need to click on submit button. Failure to do so will result in your submission not being logged with Minerva.

Furthermore, if you are planning tosubmit late, or have anextension, please note that you should only submit to the late submissions area. You should not submit a draft to the initial submission link as a submission may be logged for you and the Digital Education Service will only accept the coursework which is submitted by the deadline, regardless of whether it is a draft submission.

Marking

This is a compulsory assessment for OCOM5101M Data Science. It is worth 80% of your overall module grade.

Your project report will be marked according to the mark scheme given above.

Feedback

You will receive your mark and feedback via Minerva, in theGradebookarea,accessed from the main navigation menu at the top of the page.We will notify you by email when your feedback is available. You will usually receive feedback within 15 working days of the assignment deadline.