Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MANG6556

Credit Risk & Data Analytics

SEMESTER 1 2023/24

Individual Coursework

This assessment relates to the following module learning outcomes:

A. Knowledge and Understanding

A1. Understand the potential of CRISP-DM and data analytics, particularly in the retail lending sector.

A2. Demonstrate a critical understanding of different types of data analytics methods and the problems they can solve.

A3. Interpret the output of statistical techniques used for the main data analytics applications.

B. Subject Specific Intellectual and Research Skills

B1. Identify the statistical models appropriate for analysing the various decisions that confront a data analyst in different industries.

B2. Work with software to develop data analytics solutions, such as predictive scorecards, clustering models, and different types of regressions.

B3. Assess the relevance of statistical package outputs to the decisions being addressed.

C. Transferable and Generic Skills

C1. Critically analyse practical difficulties that arise when implementing retail credit risk models; understand the cross-fertilisation potential to other business contexts (e.g., fraud detection, marketing, CRM, etc.).

C2. Demonstrate an ability to use world-class software and to interpret its output in the relevant techniques.

C3. Manage time and tasks effectively in the context of individual study.

Coursework Brief:

Question 1 (60 marks)

The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics:

• BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan

• LOAN: Amount of the loan request

• MORTDUE: Amount due on existing mortgage

• VALUE: Value of current property

• REASON: DebtCon = debt consolidation; HomeImp = home improvement

• JOB: Occupational categories

• YOJ: Years at present job

• DEROG: Number of major derogatory reports

• DELINQ: Number of delinquent credit lines

• CLAGE: Age of oldest credit line in months

• NINQ: Number of recent credit inquiries

• CLNO: Number of credit lines

• DEBTINC: Debt-to-income ratio

1.1 Carefully pre-process the data set by considering the following activities (30 marks):

• exploratory data analysis

• missing value handling (if any)

• outlier detection and treatment (if any)

• categorisation of the continuous variables (if deemed useful)

• Weights of Evidence coding (note that some additional coarse classification might be needed).

1.2 Estimate a scorecard using a logistic regression classifier and report the following (30 marks):

• The most important variables

• The impact of the variables on the target

• The performance of the model. Use various performance metrics and discuss their relationship if any.

• Result of scorecard.

• Compare this scorecard with the results of a Random Forest. Discuss your results.

• Why do must banks use Logistic Regression as their base classifier? What do banks win and lose by doing this?

Please carefully report the various steps of your methodology and discuss your results in a rigorous way!

NOTE: It is unlikely that different students will come up with the exact same parameter estimates. Special consideration will be given to submissions whose estimates are identical.

Question 2 (40 marks)

Find an academic paper published in 2021 or later (based on online or print publication date) discussing a real-life application of data analytics. It is important that the dataset analysed in the paper consists of real-life (not artificial) data. The publication outlets in which to look for a suitable paper are:

• Management Science

• Operations Research

• INFORMS Journal on Computing

• INFORMS Journal on Applied Analytics

• Journal of Machine Learning Research

• European Journal of Operational Research

• Production and Operations Management

• Manufacturing & Service Operations Management

• ICDM (The IEEE International Conference on Data Mining)

• NeurlPS (Conference on Neural Information Processing Systems)

• KDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

The other journals which are not on the list are not acceptable.

Once you have found an appropriate paper, report the following in separate subsections:

• Title, authors, and complete citation (e.g., journal name, volume/issue, year, …)

• The data mining problem considered

• The data mining techniques used

• The results reported

• A critical discussion of the model and results (assumptions made, shortcomings, limitations, …)

• Apply the methodology you reviewed into the HMEQ dataset and report the analytic steps, model performance, and business implications.

Make sure you demonstrate that you understand what the article is all about and are able to provide a critical discussion.

Do not copy and paste from the article. Using Turnitin, this will be easily detected!

NOTE: The reviewed methodology should be different from methods applied in Question 1.