Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

1. Introduction

1.1 Background

The global capital market is a huge pool of funds, the most common of which are personal consumption loans and SME loans. Behind the rapid development of the loan market in the United States and the United Kingdom, more is actually the upgrading and improvement of market supervision and corporate loan risk control. Since the US lending club company went public, it once became the leader in the loan market industry. Therefore, the purpose of this report is to conduct an analysis of the risk control system and loan data behind the loans of LendingClub. This will help us understand the reasons behind its skyrocketing stock price and the rapid growth of its daily operation scale, as well as what is the internal motivation. 

LendingClub is the largest online lending platform in the United States, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily get loans at lower interest rates through a fast online interface.

One of LendingClub's core competitiveness is its mature and effective risk control model based on FICO credit data. FICO credit score is a personal credit rating method developed by the American Personal Consumer Credit Evaluation Company, which has been widely accepted by American society. When a borrower submits a loan application, LendingClub's system will conduct a preliminary screening, and finally classify the borrower into a total of 7 grades from A to G, each grade contains five sub-grades from 1 to 5, and there are 35 loan grades in total. LendingClub will set different loan interest rates for each loan application based on the borrower's credit report, and implement differential pricing. The higher the grade, the lower the interest rate.

1.2 Dataset

This data set is a data set of loan issuance information of LendingClub Company from 2007 to 2015, including current loan status information, repayment status and latest repayment information, etc. The data set has a total of 74 columns and contains 887379 rows of data.

 

Figure 1.1 Dataset size 

 

Figure 1.2 Dataset head inf

It contains

· Basic loan information, such as loan ID, member ID, loan amount, loan term, installment payment amount, loan date, and loan status;

· Credit information, such as credit rating and interest rate;

· Personal government affairs information, such as housing ownership, work, working years, annual income, asset income;

· Other information, such as loan purpose;

· Geographic information, such as zip code, state;

· Public record information, such as the number of times the credit file has been overdue by more than 30 days in the past two years;

· Number of inquiries in the past 6 months (excluding home and car mortgages); number of months since the borrower last defaulted on the debt, etc.;

· 

2. Exploratory Data Analysis (EDA)

2.1 Similar Distributions

We decided to start with the distribution of loan amounts to see when there was a noticeable increase in the amount of loans issued. Questions we want to know include:

· What is the amount mainly disbursed to the borrower.

· The year in which the most loans were disbursed.

· The distribution of loan amounts is a multinomial distribution.

We analyze the distribution of loans by first exploring the distribution of loans applied for by potential borrowers, the amount disbursed to borrowers, and the amount contributed by investors.

 

Figure 2.1 Distribution of loans 

In order to more intuitively compare the distribution of loans in different years, we use a histogram to visualize the data.

 

Figure 2.2 Loans from 2007-2015

Conclusion:

· Most of the loans issued were in the range of 10,000 to 20,000 USD.

· The year of 2015 was the year were most loans were issued.

· Loans were issued in an incremental manner. (Possible due to a recovery in the U.S economy)

· Loans applied for by potential borrowers, amounts disbursed to borrowers, and amounts contributed by investors appear to have a similar distribution, meaning that it is mostly likely that qualified borrowers are going to get the loan they had applied for.

2.2 Good Loans & Bad Loans

In this section, we will see what is the amount of bad loans Lending Club has declared so far, of course we have to understand that there are still loans that are at a risk of defaulting in the future.

2.2.1 Bad debt ratio situation of loans

Summarize the situation of non-performing loans and classify them into good loans and non-performing loans according to their good and bad levels.

 

Figure 2.3 Loans status

We first design a pie chart to represent the non-performing loans and good loans, which visually reflects the proportion of non-performing loans in the past seven years.

Then, we generate a histogram, using different colors to differentiate between non-performing loans and good loans, indicating the proportion of non-performing loans and good loans at the annual level.

 

Figure 2.4 Inf of loans conditions

Conclusion

Bad loans have accounted for 7.6 percent over the past seven years, with the amount increasing from 2007 to 2014 and stabilizing in 2015.

2.2.2 Geographical distribution of loans

Analyze the geographic distribution of loans, divided into West, Southwest, Southeast, Midwest, and Northeast by the state where they are located.

 

Figure 2.5 Loans issued by Region

We convert the date of the loan record into the year it is located, and divide its loan amount by 1000, so that the amount becomes a number like 145k or something like that to facilitate statistics. Set up a line chart (subject, number of lines, title), take out the year where the loan record is located, the region where it is located, and the loan amount, and classify and sum up the different years and different regions. According to their regions and loan amounts, and display the obtained The total amount of loans for different years and regions will be displayed in the line chart.

Analysis of bad loans in different regions, total number of non-performing loans, grouped by region, then the number of bad loan records by loan status, statistics of various bad loan situations under different regions and total number of various non-performing loans.

 

Figure 2.6 Loans status by region

2.2.3 Deeper Look into Bad Loans

Region does not directly affect the outstanding and non-performing status of loans, so we will further discuss the profile of lenders in different regions in the hope of identifying certain influencing factors.

 

Figure 2.7 Loans by Regions

Based on the known information that each region classifies loans into non-performing and good based on how good or bad their loan status is the direct factor for the goods being classified as non-performing is not clear, but presenting regionally different loan status warrants further exploration as to whether it is the level of risk in a particular region that is responsible for the different proportion of non-performing loans.

The results below show that region (Midwest, Northeast, Southeast, Southwest, West) does not directly affect the outstanding and non-performing status of loans, i.e. there is no significant difference in the outstanding and non-performing status of loans by geographic location. Therefore, we will further group regions according to their risk level rating or group regions according to lender profile to explore the impact of risk level rating and lender profile on the outstanding and non-performing status of loans.

2.3 Business and operation

2.3.1 Overall analysis

To understand the operating information of the lending club from 2007 to 2015.

Firstly, we list the change in loan transaction amount from year to year.

 

Figure 2.8 Changes in Loan Transaction Amount from 2007 to 2015

Then, we show changes in loan amount each year.

 

Figure 2.9 Curve of Changes in Loan Amount

2.3.2 Business operations

Now we'll take a closer look at business operations by state. This will give us more clarity in which states we have higher operating activity. This will allow us to ask further questions like why do we have a higher level of operating activity in this state? Could it be because of economic factors? or the risk level is low and returns are fairly decent?

Our strategy is to select three key metrics: the total amount of loans disbursed in each state, the average interest rate charged to customers, and an analysis of the average income of all customers in each state.

 

Figure 2.10 Hot map of issued loans 

 

Figure 2.11 Specific number of state data

Conclusion

· California, Texas, New York and Florida are the states in which the highest amount of loans were issued.

· Interesting enough, all four states have an approximate interest rate of 13% which is at the same level of the average interest rate for all states (13.24%)

· But we find that California, Texas and New York are all above the average annual income (with the exclusion of Florida), this might give us an idea of why most loans are issued in these states.

2.4 Risks of Loans

To maintain a healthy business and attract customers, it is important for the P2P platform to control the risks of the loans and keep them at a reasonable level. Out of all the undesired loan status, we decided to focus on the loans that were marked as charged off since this is the worst-case scenario where a loan is deem uncollectible and is the last thing a company wanted. By performing analysis1 on some of these attributes, we hoped to identify potential key factors/ related patterns that contributes to this situation. The risks of loans were analyzed in terms of different aspects such as geology and loan categories.

2.4.1 Geological factors

Continuing from our previous state/ region analysis of operations, we compared the Charge-off to Fully-Paid Ratio (abrv. Charge-off-ratio) and the debt-to-income ratio (abrv. DTI) in each state. DTI measures the ratio between the monthly debt payment with respect to an individual’s income. It is one of the important factors companies use to determine a person’s ability to repay. The average employment length and income were also included in attempt to identify any correlation between the labor market and loan risk.

Table 2.1 Top 5 ranked by estimate interest earned

states

charge-off

ratio

no. Loans

Avg.amount

Avg. int_rate

Avg. total

interest

Avg. dti

Avg. Income

Avg. Employment

length

CA

0.206

43434

13573.378

13.722

6.42E+09

15.435

76346.843

5.762

TX

0.188

19524

14326.378

13.713

3.11E+09

17.5

77998.893

5.666

NY

0.24

21586

13392.79

13.828

3.04E+09

14.938

75871.098

5.71

FL

0.254

17776

12793.417

13.788

2.34E+09

16.791

67745.63

5.616

NJ

0.237

9734

14045.593

13.806

1.44E+09

15.247

81235.128

5.756

 

Table 2.2 and Table 2.3 Top 10 ranked by average income and average DTI

states

Average Income

 

states

Average dti

DC

84026.707

 

IA

13.153

NJ

81235.128

 

DC

13.863

MD

80225.455

 

NY

14.938

TX

77998.893

 

MA

15.219

CT

77575.608

 

NJ

15.247

VA

77075.212

 

ID

15.374

MA

76755.395

 

CT

15.384

CA

76346.843

 

CA

15.435

NY

75871.098

 

RI

15.997

IL

75577.197

 

IL

16.188

Observation:

· California, New York, New Jersey brings the highest average total interest return (Table 2.1) while also being one of the top states in terms of average income and DTI (lower the better) (Table 2.2, 2.3)

· Interestingly Texas has the 11th lowest charge-off ratio, while having a rather higher DTI than the other countries in table 1. Logically, one would assume higher DTI will more likely suggest a higher risk.

· We suspected that the region and employment length doesn’t have great effect to the charge-off ratio, which were shown in figure 2.12 and 2.13.

 

Figure 2.12 Condition of Loans by Employment Length

 

Figure 2.13 Condition of Loans by region

 

Figure 2.14 Loan amount charged off by region

At the same time, by plotting the loan amount against time (Figure 2.14), we observed a fluctuating trend in total amount ($) charged- off approximately every two years.

2.4.2 Loan Categories

It is stated by Lending Club that the credit score (FICO score) is a major consideration for adjusting the grade and interest rate of a loan [7]. Naturally, the lower the FICO score the lower the grade of a loan. At the same time, this gives a higher interest rate meaning higher pay back to investors provided the higher risks. (Figure 2.15)

 

Figure 2.15 Condition of Loans by Loan sub-grade

However, we observed that lower graded loans actually get funded higher amount, which makes a loan even riskier.

An interesting trend we noticed was a sudden drop in loans amount for “exceptional ” credit scores in 2010, while on the other hand more amount issued to G grade loans. This can possibly be an indication in recession or financial crisis which may affect a person’s credit rating deeply. (Figure 2.16)

 

 

Figure 2.16 Average loan amount and interest rate by grade and credit score

By comparing the distribution of FICO score in each loan grade in different years, we noticed how the importance of credit scores changed and become less weighted in determining a loan risk. Now the credit scores are more ‘stretched’ in every grade, even a person with ‘fair’ FICO score can be assigned an A grade loan. (Figure 2.17). This suggests there are other factors (i.e. maybe homeownerships) that affects what the loan grade is.

 

Figure 2.17 Violin Plots of credit score by loan grade in 2007 (left) and 2015 (right)

3. Further work

To further identify the other contributing factors to the risks of a loan in a more robust manner, we preprocessed the data and applied different feature selection techniques on the data.

3.1 Data Preprocessing

3.1.1 Unnecessary attribute

Some attributes that are related to after a loan is granted were also dropped as they would not be available at the time when customers apply for loans.

3.1.2 Redundant attributes

Pearson correlation coefficient and correlation heat map (Figure 3.1) is used to identify collinear features. At a coefficient threshold of 0.75, we dropped the 'installment' and 'total_rev_hi_lim' to reduce redundancy.

 

Figure 3.1 Correlation Heatmap of features

3.1.3 Encoding Categorical Data

In order to run our dataset through some of the machine learning algorithms, the categorical data entries had to be converted into numerical values. Our categorical data were listed below. (Table 3.1) The ordinal data were mapped according to their meaning, whereas the nominal data were transformed by ordinal encoding. One hot encoding was not considered for address_states since it has high cardinality, which can lead to overfitting in some algorithm methods.

Table 3.1 nominal and ordinal data

Nominal Data

Ordinal Data

home_ownership, verification_status, loan_status, purpose, addr_state

term, emp_length, mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, mths_since_rcnt_il'

3.1.4 Remove Quasi-Constant features

Attributes who have the same value for the vast majority of dataset are not very useful since it acts almost like a constant. Using the VarianceThreshold filter method of sklearn, attributes were drop by the threshold of 99% similarity.

3.2 Feature Importance

3.2.1 Filter Method

Since we have the two loan conditions as the possible outcomes (charged-off and fully-paid). We treated this as a binary classification problem and applied filter methods such as ANOVA (f-test) [5] and Chi-square test [4] to our numerical and categorical features respectively to analysis their importance (Figure 3.1, 3.2).

 

Figure 3.1 Feature Importance of numerical features

 

Figure 3.2 Feature importance of categorical features

3.2.2 Embedded methods

Features are ranked relatively in a tree algorithm. By accessing the feature_importance_ function of tree-based model [2], we would be able to get an average features importance across all the trees. Feature importance were obtained from 4 different models and the results were combined to give us a combined ranking on the importance of the features. (Figure 3.3)  

 

Figure3.3 Combined Feature importance by tree-based algorithms

3.3 Modeling and evaluation

3.3.1 Linear Dependence of Charge-off on the Predictors

 

Figure 3.4 Correlation parameter

It can be seen from the analysis of the correlation coefficient that int_rate has the greatest correlation with the quality of the loan.

3.3.2 F-test & p-value

 

Figure 3.5 F-test & p-value

Peer-to-