Understanding Prosper Loan Data Set

Uirá Caiado. April 7, 2015

Abstract

I explore a data set about P2P loans. My goals with the study are understanding how this practice works, and what makes a loan to be bound to a specific risk score.

Introduction

I have to tell the truth. Initially, the number of variables intimidated me. There are 113,937 rows in this data set and 81 columns (variables). It seems to come from Prosper website, a peer-to-peer P2P lending marketplace. I have already seen a brazilian website trying to do something similar in my country, but our “SEC” shut them off. Just financial companies can lend money here.

According to wikipedia , the SEC imposed that Prosper had to cease their operations due to violations on a piece of legislation that ensures transparency and try to avoid fraudulent activities in the american securities markets. The website took from November 2008 to July 2009 to resume their activities.

Bellow you can see the effect of this time closed:

There are 6 variables in the data set just valid to loans originated after 2009 and I believe that they were created to comply regulatory rules. In general, they are all related to how much interest have been paid and how risky each loan were. Looking at other variables, I can see that they are pretty diverse but they are all related to these two factors: risk and return.

There are variable directly associated with risk measurements (what was the classification when the loan was took, the scores from Prosper), about how much was paid and how much was earned, meta data of the loan (the size in money and duration, when and where it was originated), about the profile of the borrower (employment status, his income, his debts, what he intend to do with the money), how each loan was founded and so on.

In the next sections, I will explore 11 variables of the data set, as described bellow:

Term: The length of the loan expressed in months;
LoanStatus: The current status of the loan, as Cancelled, Defaulted, etc;
ClosedDate: Closed date, when applicable;
BorrowerRate: The Borrower’s interest rate for the loan;
ProsperRating (numeric): The Risk Rating assigned at the time the listing was created. Varies from 0 (worst) to 7 (best);
ListingCategory: The category of the listing that the borrower selected when posting their listing;
IsBorrowerHomeowner: If the Borrower is classified as a homeowner by the Prosper Criteria;
DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulle. This value is capped at 10.01;
StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created;
LoanOriginalAmount: The origination amount of the loan;
LoanOriginationDate: The date the loan was originated.

As P2P lending is something new for me, I would like to understand how it works in United States and I will use this database to achieve that.

Analysis

Univariate Section

Loan Meta Data: Amount Borrowed by Month of Loan Origination

In Brazil, where I live, I would expect the total loans rose at some months of the year. Usually at the beginning and the end of the year, when people travel, pay taxes and buy gifts. Let???s start looking at the amount of money borrowed by month in United States

As expected, the amount of loans have risen at first and last months of the year.

Loan Meta Data: Amount Defaulted by Month of Loan Origination

If there are more money borrowed on these periods, maybe the number of loans defaulted increases between them. To verify that, I have to use the ClosedDate variable and filter just the defaulted and chargedoff loans. Let’s see:

Curriously the distribution of the amount borrowed by month looks like the distribution of amount defaulted/chargedoff by month.

Loan Meta Data: How Defaulted Loans are distributed

Well, I am not sure if this filtered data is relevant, let me see how many data points there are in each status. I will group all ‘Past Due’ buckets as the same buckets and show off the amount of data on each status as percentage. As can be seem bellow, 11% of the data is related to defaulted or chargedoff loans:

## 
##              Cancelled   Chargedoff/Defaulted              Completed 
##                   0.00                   0.15                   0.33 
##                Current FinalPaymentInProgress          Past Due Date 
##                   0.50                   0.00                   0.02

Thinking again, nobody would borrow money to pay back in the next couple of months, at least not in Brazil. Let’s see how many loans were borrowed, percentually, divided by duration:

## 
##   12   36   60 
## 0.01 0.77 0.22

It explains why the distributions of Defaulted/Chargedoff loans and the amount borrowed by month looks like the same. As people take loans on regular durations, 12, 36 and 60 months, the loans will ever due to the same month in different years.

Loan Meta Data: Loan Original Amount and how it is distributed by Term

What about the median of the loan? First, let’s see the quartiles of the value of each loan:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Wow…$ 1,000 to $35,000. It is a huge range. But it seems that the most of the loans are smaller, given that 50 % of the data set is up to $ 6500.

Let’s take a close look at the loan original amount splitted by duration. I had to convert the variable “Term” to factor to make the chart below:

It seems that the loan for each duration is something that I should not plot together. The distribution of each one is pretty different.

Loan Return: Take a Look at the Distribution of Borrower Rates

Well, and the interest rate? The distribution differs at each term? First, let’s see the histogram of the interest paid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There is something different on the right tail. Let’s reduce the bin width to 0.5%

## Warning: position_stack requires constant width: output may be incorrect

Definitely there is something different after 30% of annual interest rate.

Loan Return and Loan Meta Data: Investigating Outliers in the Interest Rates

Let’s take a close look at the range between 30%-37%.

## Warning: position_stack requires constant width: output may be incorrect

Now, more specifically in the range 31.5% and 32%.

## 
##  0.316 0.3165 0.3166 0.3169  0.317 0.3174 0.3175 0.3176 0.3177 0.3178 
##    121     60      1      1      2      3      9      2   3672      1 
## 0.3179  0.318 0.3185 0.3188 0.3189  0.319 0.3195 0.3197 0.3198 0.3199 
##      2      1     27      5      2      4     12      1      3   1651

Curiously there is a huge concentration at 31.77% and 31.99%. Let me see when those loans were took.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

Ok, I can see that there is no evidence of relationship with date. Maybe I find something related to the category of each term:

## 
##   36 
## 5323

Interesting, all loans with these interest rates have a term of 36 months. Let’s see how the rates are distributed by Term, considering all the data set.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The most of loans have a term of 36 months. It is curious to see that just loans with term of 36 months have a more dispersed distribution. However, it is difficult to see how the how loans with term of 12 months are distributed. Below are the Box plots.

As previously observed on the frequency polygons , the box plot confirms that 36-months loans are more dispersed. Additionally, I can notice that the median increases sharply from 12 to 36 months and then increases slightly to 60 months. It makes sense, once that a long-term loan is much more unpredictable than a 12 months loan.

Loan Risk: Prosper Scores

Following my reasoning, if the borrower rate are dispersed, I would expected that the Risk Score also was dispersed, given that the interest rate should be a function of the risk score. First, let me see the quartiles of the Prosper Rating.

## 
##     1     2     3     4     5     6     7  <NA> 
##  6935  9795 14274 18345 15581 14551  5372 29084

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   4.072   5.000   7.000   29084

As I can see, the risk scores are discrete and ordered. I believe that it is not a issue, given that a average rate of 5.3, for instance, has a meaning. It tells me that there is more data point on 5 than in 6. I also notice that there are many data point without a risk score. Below is a histogram just including those that has a score setted:

Well…They are not dispersed….and, thinking again, actually they shouldn’t be. The risk score is discrete, there is no way to be dispersed. I would need verify the distribution splitting the data by term, but I will check it later.

Borrower Profile: Borrower Monthly Income

Maybe the income can help me understand something:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

75% of the incomes declared are less than or equal to $ 6,825 and the maximum income value is $ 1,750,000. If I look at the income range, I can see that it was grouped just until $100,000.

Let’s look at data where the monthly income less than 100,000 and has a prosper score setted.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It doesn’t add up much to my thoughts. I believe that these variables have to be related to another to be more meaningful.

Loan Meta Data: What Were The Reasons For The Loans ?

Now, let’s see why people borrow money,considering just who has a prosper rating setted.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

I need to rename the categories. Also, I am already seeing that there are a lot of categories that has less than 2,000 observations. I am going to rename those to ‘Other’.

Much better. The loans to roll over existing debts are predominant. Let’s see now how people were deep in debt when they took the loans using Prosper.

Borrower Profile: How Much of the Monhtly Income Is Committed to Loans ?

Given that the loans labeled as Debt Consolidation are most common, I would expect that the ratio debt o income be greater. Let’s see how the ratio is distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1400  0.2200  0.2759  0.3200 10.0100

There are some outliers here. I am going to filter out the top 1% of debt/income ratio in the next plot. Also, I will apply a log transformation here.

I am seeing that there is some kind of distribution here. The most of people have less than 30% of their income committed to debts. Also, I believe that there is no need to transform this variable in my future explorations.

Wrapping Up the Univariate Section

So far, I have seen some evidences that people take more loans in specific months and the interest rate of these loans are strongly related to the term of the each loan. There is something curious about loan with 37%-38% of interest rate and I have not understood why loan with term of 36 month are so much more dispersed than 60 months, although the median of interest rates on 60 month are slightly greater than 36 months.

Also, I didn’t find nothing so interesting about risk score and income level, two variable that I was expecting more. On the other hand, maybe this variables have more to tell when related to others. I would like to understand better why loans with term of 36-months are more dispersed than 60 months. Also I want to find out why risky loans are classified that way.

Bivariate Section

Loan Return And Loan Risk: Daes The Risk Pays Off ?

Let’s start looking at how risk score and borrower rate relate with each other. I’m expecting to find a strong relationship between them. After all, the interest rate should reflect the risk of the loan. I am going to add some jitter on the chart and change the alpha to reduce the overplotting. I am also excluding monthly income greater than $ 100,000.

As the Prosper Rating is a discrete variable and the Borrower Rate variable is continuous, the chart without the jitter can look like a bit cluttered. I can notice in the figure above that as better the risk score, more common are lower rates.

This chart also shows that loans with higher interest rate are much more common in loans with higher risk (smaller scores). Let me see the correlation between these two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and BorrowerRate
## t = -854.5218, df = 75587, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9526073 -0.9512698
## sample estimates:
##        cor 
## -0.9519431

As expected, there is a strong negative relationship between the two variables, confirming that smaller scores results in higher interest rates. As I said before, It was expected.

Borrower Profile And Loan Risk: Are The Wealthy People Less Risky?

What I would like to understand is WHY this loans were classified as risky. Let me see what is the relationship between risk score and monthly income. I am also filtering out monthly income equal to 0 (zero).

Looking at the chart above, It seems that there is some relationship between the two variables tested. The greater the monthly income, it is classified with a better risk profile more commonly . On the another hand, I am also seeing that even people with high income are classified as high risk. The dispersion of the observations suggested that, the monthly income, alone, does not explain the Risk Score. Let’s see the correlation test between the two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and StatedMonthlyIncome
## t = 64.8236, df = 74894, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2236993 0.2372620
## sample estimates:
##       cor 
## 0.2304918

The test shows that the correlation is significant, given that the p-value is lower than 0.05 and indicates that as higher the monthly income, better is the risk score, as I observed in the box plot. The correlation coefficient, around 0.2, indicates that this relationship is not strong.

Borrower Profile And Loan Risk: Are People in Debt Worst Classified ?

Maybe the correlation between the Prosper Rating and the Debt to Income ratio is stronger.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and DebtToIncomeRatio
## t = -33.9606, df = 69134, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1354213 -0.1207577
## sample estimates:
##        cor 
## -0.1280965

This test also shows that the correlation is significant and indicates that the debt/income ratio and the Prosper Rating are negatively correlated. Said another way, the higher the debt/income ratio, worst is the Prosper Rating. But, again, the correlation coefficient, around -0.1, is not strong.

I wonder if there is any pair of variables, that is not Prosper Rating and Borrower Ratio, that presents a strong correlation between them.

Loan Meta Data And Loan Risk: Loan Amount and Risk

Maybe there is something about the amount of the loan and Risk Score. I going to keep the same filters on the data that I used before and calculate the mean loand amount on each risk category.

It seems as some values of loans were allowed just for certain risk scores, but it does not add up to my understanding of Risk classification.

Loan Meta Data And Loan Risk: Listing Category on each Risk Score

Let me see how the loans for each risk score are divided by ListingCategory.

There are a lot of loans that were taken for Debt Consolidation on all Risk levels. If this listing category was predominant on all levels, it suggests to me that it is not a driver of risk level. Thus, I will not use this variable further on my analysis.

Borrower Profile and Return: Debts and Interest Rate

And how Debt to Income Ratio is related to Borrower Rate?

## Warning: Removed 7296 rows containing missing values (geom_point).

It is confuse. I am going to break up the Debt To Income Ratio into a categorical one and plot the two variables again. The most of classes I have defined as the quartiles of the data.

Nice, here is something useful. There is some relationship between the two variables and it is coherent. The more one is in debt, the more expensive is to take more loans.

Borrower Profile and Return: Monthly Income and Interest Rate

Talking about debt to income ratio, let’s see the box plot of this variable related to Monthly Income. As I have seen before, there are big outliers here. Then, I am going to plot just income until $ 10,000 per month.

The level of debt is directly related to income. It corroborates with what I have found before. If people with bigger Debt to Income Ratio pay higher interest rates and this ratio is negatively related to monthly income, then, the bigger income, the smaller the interest rate.

Borrower Profile and Return: Homeowners and Interest Rate

If there is this relation with income, maybe there is some relationship with the assets of the borrower. Let me check the interest rate distribution when people are home owners or not. I will filter out borrower rate smaller than 39%

Here I also found some relationship. I guess It couldn’t be different . When you talk about risk, you talk about uncertainty. If you are confident that the borrower will pay the loan back, you can charge a lower interest rate. It is how things work between countries, makes sense that it works in the same way between people. If someone earns more money per month or he has a house, you might be more confident that he will have conditions to pay you back.

Risk and Loan Meta Data: Risk Scores and Defaulted Loans

Well…I am talking over risk and uncertainty….but, first of all, loans classified as risky really are riskier? Let’s check the default ratio between risk scores.

## No id variables; using all as measure variables

Yes, they are. The number of borrowers that didn’t pay the loan back are greater at worst risk scores.

Risk, Return and Loan Meta Data: Are They All Related?

Ok… I want to check one last thing. Let’s see the distribution of borrower rate by month. In order to put into perspective, I am also going to plot the default ratio by month and the total number of loans by month.

## No id variables; using all as measure variables

Look to that. As the number of loans by month increases, the median of interest rate decreases, as well as the percentage Defaulted ratio. Right now, I can’t say why , although I can think in a reason for this behaviour. For instance, in finance, if you want to reduce the risk of your investments, you should diversify them. If I suppose that the interest rate is some function of the defaulted ratio, and then I diversified to whom I lend money, I would expect that the defaulted ratio went down and then I could reduce the interest rate.

Wrapping Up the Bivariate Section

In this section, I focused on understanding what makes someone be classified as risky. I have explored many variables. Many of them presented some correlation between what I was looking for, but not strong enough to allow me to formulate some hypotheses.

I have learned that loans with worst risk scores are, in fact, risky. They present a bigger defaulted ratio. As expected, I also have found that better risk classification is related to smaller interest rates and the amount that someone borrows.

Also, I verified that the more deep in debt someone is, related to his income, the more expensive is to take loans. My analysis has shown that the level of debt from someone is related to his level of income. I also have saw that when someone has a house, the average interest rate is smaller.

In the next section, I will try to find out how the term variable relates to what I already have found.

Multivariate Section

Return,Profile and Meta Data: Interest Rate, Term and Debt Ratio

As I have seen in the univariate section, the dispersion of interest rate with term of 36 month is greater than others. In order to understand that, I am going to plot the Borrower rate by Term, adding some colors using the buckets that I have created related to Credit to Debt Ratio.

It seems that loans with term of 36 months are predominant at higher Debt to Income ratio. It doesn’t necessarily mean something.

Borrower Profile and Meta Data: Percentages of D/I ratio Buckets by Term

Let me see how many loans are at at each bucket, divided by term. I am going to plot all buckets in one figure and just the worst debt to income buckets in another one.

I can see that all Terms have a small number of loans in the worst Debt to Income (D/I) buckets (equal to or greater than 1). On the other hand, when I filter out the smaller buckets, I see that the Term of 36 month has a bigger participation of buckets greater than 2 D/I ratio where the average borrowing rate is higher, as I already have seen before.

Meta Data and Return: How Interest Rates with different Terms Changed Over Years?

I can’t say that that is related to the dispersion of loans with Terms of 36 months, but still I can’t discard this hypothesis. I am going to plot the boxplot of the Borrower rate by Term and filtering out the D/I greater than 0.14. I am also going to facet by year to see if the behaviour changes over time.

Interesting. The dispersion is not related to the D/I ratio, given that there is dispersion even in Debt To Income Ratios smaller than 0.14. It remains true at almost every year where the Prosper Score were set (after 2008). Just in 2014 I can’t verified this relation.

Meta Data and Borrower Profile: How D/I of Different Terms Changed Over Years?

Let me plot the defaulted ratio by Term to see if it sheds some light on what makes the 36 months more dispersed.I am going to faceted the next chart by year of ClosedDate and filter the loans with D/I rate smaller than 0.14 because it accounts for 55% of the overall data.

I can see that the loans with 12 month of Term, that also is the one with smaller Borrower Rate, presented a much smaller defaulted rate overall.

This chart also shows that the loans with Term of 60 months presented a smaller defaulted ratio than the loans with Term of 36 months in 2011 and in 2012. It just changed in 2013.

As the default ratio of Terms of 60 months was becoming higher, the borrower rate was growing. Finally, borrowers rates with this term sharply increased in 2014 after 2 years of higher defaulted ratio than the ratio of shorter terms.

It would explain why the borrower rate of loans with terms of 36 months are more dispersed than those with 60 months. For me, it is due to the loans with longer terms have shown a smaller defaulted ratio in early stages, accumulating data with smaller interest rates. It suggests that I can’t use the some of features without considering the date they are related to.

Wrapping Up the Multivariate Section

In this section, I learned that the worst debt to Income Ratios are more common in loans with terms of 36 months. I also verified that the defaulted ratio changed differently over time when considered by Term, what may have affected the borrowers rates.

Final Plots and Summary

First Figure: Defaulted Loans and Interest Rate by Month of Origination

The figure bellow shows how many loans were classified as defaulted on each month, as percentage, considering the date that they were originated. Also shows how the borrower rates were distributed on each month. I called as defaulted every loan that was classified as defaulted or charged-off.

## No id variables; using all as measure variables

The percentage of defaulted loans increased from February (6%) to March (9%) with a peak in April (12%). The average borrower rate also increased during these months from 15% to 17% to 20%. Overall, the interest rate ranged from around 5% to 35% in the data selected. Also, the sazonality in both features is similar. In January, February and December were the smallest medians of Borrower rates (around 17%) and percentage of defaulted loans (5%, 6% and 6% again).

In other words, ss the percentage of defaulted loans increases, the average borrower rate also increases. With this chart, I suppose that both variable were linked somewhat and this insight drove to my last analyzes on this data set.

Second Figure: Borrower Rates By Term and Year of Origination

The next one, I choose It shows boxplots of the borrower rate splitted by term, showing separately the dispersion on each year.

The average borrower rate of loans with term of 36 months and Debt to Income Ratio smaller than 0.14, increased from 2010 (19%) to 2012 (20%) and decreased to 2014 (12%), the smallest average rate for this term. The loans with term of 60 months increased from 2010 (14%) to 2012 (21%) and decreased to 2014 (16%). The average borrower rate of terms of 60 months became higher than the term of 36 month since 2012. Although the interest rate for terms of 12 months has become more dispersed, the maximum value for its average was around 12%.

Also, the figure above shows that the loans with 12 and 60 months have begun to be made in 2010, with the average borrower rate much smaller than loans with 36 month. As pointed before, in 2011, the loans with the two Terms have already begun to be more dispersed, although the distance between the second and the third quartile always was more wide in the loans with Terms of 36 months.

I would say that the borrower rate of loans with term of 60 month has became really greater than the 36 months in 2014, when even the second quartile of the 60 months almost became greater than the third quartile of loans with term of 36 months.

This chart helped me understand how the dynamic of the borrowers rates changed through the years. Initially I had observed that loans with 36 month had been more dispersed than the 60 month and I assumed that it was a characteristic of the loans. After this chart, I wondered if there were another reason for the dispersion.

Third Figure: Defaulted Loans By Term and Closing Year

Then, following my previous reasoning, I plotted the ratio of the defaulted loans by the year when the loans were closed, splitting them by term.

The percentage of defaulted loans with term of 60 months increased from 2012 (20%) to 2013 (33%) and decreased in 2014 (29%). In comparison, the defaulted ratio with term of 36 month has become smaller than the Term of 60 months when it slightly decreased from 2012 (23%) to 2013 (22%) and in 2014 (18%). The defaulted ratio of loans with 12 months never rose from 7%.

In the chart above, I decided to use the year when the loans were closed because I had imagined that years when were verified a increase in defaulted ratio, the borrower rates probably would also increase.

As can be seen, in fact, it happened. In 2013 and 2014, when the defaulted ratio of the loans with term of 60 month sharply increasesd(as shown above), the median of borrowers rate also increased (as shows the Second Figure).

I choose this last figure because this relation between defaulted ratio and borrower rate helped me to understand why the borrower rate changed his behaviour throughout the years between different terms.

Reflection

At the beginning, I didn’t know what to looking for in the data set. I believe that the number of features confused me at first and I spent some time trying to understand the variables before start the project. Then, I just started to poke around the data, trying to find something interesting to explore, without success. Then, I decided to make some assumptions and try to check them in the collection. After that, my project took off.

As I am used to Scilab and Pandas, at the beginning I was thinking that the syntax of R would not be an issue for me. I wasn’t totally right. I suffered trying to group and reshape the data set, but after that I understood the basic logic of the libraries I was using, it wasn’t so painful.

During my analysis, I verified that many variables that I have believed that would have a greater effect on Borrowers rate, actually presented just a smaller role. Maybe if I tried to build a model to predict the borrower rate, all these variables would be useful, but not alone, just together.

At the end, I just found two variable that helped me to understand the borrower rate variations. Debt to Income Ratio and the overall Defaulted ratio. The last one was curious. It seems be connected to the level of all interest rates for a particular year.

I also did not find any good reason to transform any variable. I have expected to find something, given that the data set is related to money.

I stopped my analysis when I have started to explore how the distributions of this collection had been changing over year. Maybe if I crossed this data set with economic indicators I could find some interesting stories. For instance, would be interesting to check if there is a correlation between unemployment rate and defaulted ratio. It probably would affect the overall borrower rates.