关键词 > Python代写

FOUNDATIONAL BUSINESS ANALYTICS

发布时间：2021-12-10

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COURSEWORK 2021-2022

1. The Problem Definition:

To ensure business sustainability and enhance revenue, the credit card company plans to launch a proactive credit risk management programme. The goal is to enable the company to “see into the future” and know in advance which customers are more likely to default in the next repayment period. With the ability to predict customer credit default, the company can intervene at the earliest opportunity so that it can have more options for remedying the situation and avoiding losses.

The credit card company has accumulated detailed records of customer repayment history. The data includes information about customer demographics as well as their spending and repayment behaviours from April to Septermber of this year. And of course, the data also vitally contain the information as to whether the customers failed to repay in the next period, i.e., October .

This data can shed light on the sort of people who will miss payments in the future and the customers who are valuable assets to the company, i.e., customers who repay duly. Your task, as a consultant, is to analyse the historical dataset, and generate a model that can be used to predict if any individual will fail to repay their debt in the next period (the cost of failing to identify customer default could be very high for the company).

As well as robustly testing, justifying and unpacking your selected model (guided by the CEO’s needs, as detailed below), the credit card company also want you to produce some business recommendations - what you think the company should focus on as a result of your investigations. You will submit a formal business report (with a strict 8 page and 3000 word maximum). Additionally, you will submit your model implementation, with instructions on how to use it to test new data (written in either Python, Orange3 or some combination - formal specifications are detailed below). Good luck!

2. Important Message from the CEO:

“Of course, to management, an overarching goal is to predict customers who are likely to miss payment in the next period. If we could identify which customers are not repaying in the next period, we could investigate and intervene early to minimize our losess. Of course, customers may get annoyed when we investigate their repaying ability and default risk, especially when they have no intention to miss payments (indeed some may even not want to deal with us again, but this is rare). But this isn’t an issue I want to focus on, as our real business cost here will be the losses caused by customer default. Avoid this ifyou can.”

3. The Available Dataset:

You will have been provided with a dataset in CSV form from Moodle, containing 6000 samples of customer data- and whether they repaid in the next period.

Your training dataset can be downloaded from Moodle. Note that your dataset will be different to other students, so you will expect different results. As you will see from the first line of the datafile (which reflects its header), it follows the schema below:

4. Formal Task Specification

• You must provide a classification approach to predict which individuals are more likely to default in the next period. This will require a stage of statistical analysis, a stage of model selection, a stage of final model training, and then an analysis of implications. You may use any software you desire for your analysis, but your model must be produced in either python3 or Orange3 for this coursework (or some combination).

• Your submission will consist of a zip of the files for your model, and a report of a maximum of 8 pages. Your model will be tested on a hidden dataset (with the same schema as the training dataset, but without the feature “Y”.

Your report must strictly adhere to the following sections, but please take into account the marks available for each in structuring your submission:

Section A: Summarization [10 marks available]

❏ In this section you must provide a summary statistical analysis of the dataset. Consider how each input feature present is related to the output variable (“Y”). Additionally, you may want to examine how they relate to each other. Please feel free to use tables, bar charts, or scatter graphs depending on the feature - it is totally up to you. Note, the point of this section is to be informative rather than overloading your client with information, so also summarize the key analytical points you have observed in the dataset.

Section B: Exploration [20 marks available]

❏ Apply a decision tree to the dataset to unpack, examine, and identify a first -cut through influencing factors in the data. Which variables appear to be important? Do some combinations of variables allow you to identify useful sub-populations in the data? Are all variables useful? Discuss this analysis (linking to your analysis in section A if appropriate). You do not have to visually represent the resulting decision tree, but this is highly recommended and likely to aid your presentation of this initial exploratory analysis.

Section C: Model Evaluation [25 marks available]

❏ Select at least 3 different classification model classes (selecting only from those we cover in FBA lectures: Logistic Regression, Decision Trees, Random Forests, Naive Bayes Classifier and k - nearest neighbours), and assess their effectiveness in modelling your historical training dataset (which is unique to you) against a point predictor benchmark (i.e. the mode of yes/no’s). This should be undertaken in either Python3 or Orange3.

❏ Inyour report, detail the models selected to test and why they were chosen. Detail the parameterizations you chose for each model, explaining why you have chosen the parameters that you have.

❏ Describe the evaluation strategy you chose to compare models to each other (including evaluation statistics and performance measures as you see fit) justifying your decisions in full.

❏ It is expected that your analysis of the outputs of each of the models will be examined in terms of the confusion matrices that they produce. Relate this to your choice of performance measure.

❏ Any code/files used in this process may also be submitted, to contribute to your code/file submission mark

Section D: Final Assessment [5 marks available]

❏ Given the analysis in Section C, justify a ‘winning’ classifier and why you have selected it for your final model, paying close attention to the business case in your consideration of measuring success.

Section E: Model Implementation [5 marks availablefor write up]

❏ Having selected the single, best performing model, that model must then be trained against the whole training dataset ready for deployment. This section should specify that choice, and briefly describe the resulting code/project files that are attached with your submission. In particular, this section should be used to supply brief instructions on how the recipient should use your submitted model code/files to process a new test set, and to make new predictions from your model.

❏ N.b., marks awarded here are only for your write up/instructions, with more marks available for the assessment model’s implementation code/files- see “further available marks”

Section F: Business Case Recommendations [5 marks available]

❏ The final section of your report should summarize the business case to the client (IMBA banking), providing business recommendations for further potential analysis.

Further Available Marks:

❏ Overall Presentation of your report, its argument and professionalism → [5 marks available]

❏ The standard of your submitted Evaluation/Final modelling code/workflows. It will be expected in this code/workflow you will also have supplied some means for the user to load in new data (in the same format as your supplied dataset) and make new predictions. → [20 marks available]

❏ The Effectiveness of your model as assessed against our held-back test dataset → [5 marks available]

Note, that the models you submit will be tested on another external dataset, that I have held out (and which you will not have access to, reflecting the fact that these represent “future” customer repayment status). Thus, as well as receiving marks for your report, your model implementation and how well you have tested, evaluated and justified its construction, there are also additional marks for how well it will predict our hidden test set!

6. Submission

→ In your submission please submit a zip of the following files:

1. Your Final Report (maximum 8 pages excluding the front page. Please indicate your word count at the end of your report).

2. Your Evaluation Code / Workflow files and Final Model Code / Workflow

→ Submissions must be submitted via the moodle submission link → Submission must be received by: 13th

December 2021, 3 pm

Potential Penalties:

→ Late submissions will lose 5% from their final mark per day.

→ Submitted reports over 8 pages will be received, but only the first 8 pages will be assessed. This is a strict rule.

7. Final Important Note on Plagiarism

→ Each of you have been provided with a slightly different training dataset, so you are expected to have different results to other people. This is obviously to ensure you are working individually, and we will test your resulting model on the dataset you were specifically allocated.

→ All code and workflows will also be examined to ensure there is no repetition between submissions, so while you are able to share ideas and strategies, the implementation and analysis must be 100% your own individual work. Any plagiarised work will immediately receive zero marks, and notified immediately to the School.

8. Some Additional Tips!

• Throughout this coursework, showing thought processes and understanding of how you assess a model in light of the business case is more important than the final predictive test result.

• Similarly, and as reflected in the mark scheme, illustrating your understanding of robust

model evaluation and comparison is again more important than the final implementation for

this coursework.

• You may use any analysis tools to formulate your report, but your submitted model must be implemented in Python or Orange (or both). You can assume the recipient is using python 3 and Orange3 respectively, and has sklearn, scipy, numpy, pandas, matplotlib, seaborn installed. Any further requirements must be clearly specified in your submission with instructions.

• Note the page length available in total, and the available marks for each section to assess how much time and effort to place in each.

• Note that presentation of your work is also being assessed. This is a formal report directed to a business professional, and should be formatted and worded accordingly.

• Using python rather than Orange will not necessarily gain you any extra marks. However, it will likely give opportunity to show off your work with more sophisticated analysis, and increase potential of obtaining higher marks in those respective areas.

• Ifyou choose to illustrate a decision tree - do so for a reason, make it visually useful. No-one wants to see a page of 100’s of nodes - so think how best to present the insights it holds!