Due to many variables in this dataset, I only use 12 variables. Those are:

Univariate Plot Section

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04228  0.00000 39.00000

## 
##     0     1     2     3     4     5     6     7     8     9    14    16 
## 87862  2438   403    78    23     8     3     3     2     5     1     2 
##    21    24    39 
##     1     1     1

Recommendations: Number of recommendations.

This is the number of recommendations that Borrowers have when listing the loans. We can see that the number of the recommendations is right skewed, with the mean 0.04, which in average borrower doesn’t has many number of recommendations. And it goes as high as 39 recommendations for just one borrower.

I choose histogram as the chart as it depict the distribution of numerical variable. Since this is right skewed, I log scale the number of recommendations.

Rate: The Borrower’s interest rate for this loan.

The borrower’s rate follow almost unimodal distribution, with peak around 0.16. There’s spike occurs, around 0.3, but doesn’t make enough to be second peak.

## 
##     1     2     3     4     5     6     7 
##  5487  7521 10839 15224 12633 11081  4151

Rating : The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.

The Rating could also be null, if the Prosper system can’t rate the loan. And about 29084 loans can’t rated by Prosper, which means that the loans originated before July 2009. The number of rating almost follow the order of the rating, except A-grade is the highest number of rating, AA comes second, and the rest following the order.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 14398 47562  5242  5406  1851   561  1738  7903   150    64    76   179 
##    12    13    14    15    16    17    18    19    20 
##    44  1602   746  1117   253    41   669   597   632

Category: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

I choose Bar Chart for this Listing Category since this is categorical variable. Out of the listing category, three categories comes out the highest past 10.000 loans. There’s N/A and Other categories, so we can’t know for sure the specific category. But the highest is 1, which is Debt Consolidation, where’s one take out a loan to pay many others. This comes really high with 58308 loans, overshadowing the rest of the categories. It could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it.

## False  True 
## 45292 45539

Homeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.

When looking at the loan that’s been listed, we see that borrower that is home owner is around equal proportion to those who do not. So it’s not affecting much.

Income: The monthly income the borrower stated at the time the listing was created. Still we see that isn’t much going on with the monthly income.

Monthly Income will definitely be right skewed, since fewer people will have higher salary. So I cut the outliers and log 10 scale.

## 
##    0    1    2    3    4    5    6    7    9   11   12   16   21 
## 8445   66   27   19    4    6    3    5    1    1    1    1    1

PaymentsLate: Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.

Here’s we see that number of late payments is also very right skewed. Since this is number and not floating point, the number isn’t represent as smooth histogram.

Well looking at the frequency count, we see that most of the borrowers have no recommendations when they listing a loans. It may seems really hard to get recommendations in the listing loans, since we see that number only few people that can get at least higher than ten recommendations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7000    8644   12500   35000

Amount: The origination amount of the loan.

Now this is interesting. Here we see that the distribution is still right skewed, but the peak seems like it’s around 4000 dolars. In summary table we see that it’s the first quarted. While the median is 6500 dolars. Notice that max loan is 35000 dolars, and with people who have high salary as high as ~2 million dolars monthly income, it doesn’t make sense if he borrow money with such low amount. Since StatedMonthlyIncome is human input, it can’t be trusted hence I will exclude the outlier from now on.

## 
##    12    36    60 
##  1062 69530 20239

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   41.07   36.00   60.00

Term: The length of the loan expressed in months.

Here we see that term of loans only have three categorical, 1 year, 3 years, and 5 years. Most of the loans made around 3 years, with 5 years comes with 24545 loans and 1614 loans at 1 year. I can expect the distribution of the term loan to be nearly normal.

## 
##                                                        Accountant/CPA 
##                               3190                               2439 
##           Administrative Assistant                            Analyst 
##                               2912                               2748 
##                          Architect                           Attorney 
##                                162                                798 
##                          Biologist                         Bus Driver 
##                                 94                                275 
##                         Car Dealer                            Chemist 
##                                140                                116 
##                      Civil Service                             Clergy 
##                               1109                                159 
##                           Clerical                Computer Programmer 
##                               2558                               3301 
##                       Construction                            Dentist 
##                               1548                                 61 
##                             Doctor                Engineer - Chemical 
##                                402                                178 
##              Engineer - Electrical              Engineer - Mechanical 
##                                866                               1108 
##                          Executive                            Fireman 
##                               3400                                327 
##                   Flight Attendant                       Food Service 
##                                 95                                943 
##            Food Service Management                          Homemaker 
##                               1008                                104 
##                           Investor                              Judge 
##                                152                                 18 
##                            Laborer                        Landscaping 
##                               1343                                190 
##                 Medical Technician                  Military Enlisted 
##                                929                               1003 
##                   Military Officer                        Nurse (LPN) 
##                                250                                422 
##                         Nurse (RN)                       Nurse's Aide 
##                               2101                                437 
##                              Other                         Pharmacist 
##                              22931                                209 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                146                               1272 
##                     Postal Service                          Principal 
##                                488                                257 
##                       Professional                          Professor 
##                              10368                                452 
##                       Psychologist                            Realtor 
##                                124                                439 
##                          Religious                  Retail Management 
##                                 90                               2115 
##                 Sales - Commission                     Sales - Retail 
##                               2788                               2248 
##                          Scientist                      Skilled Labor 
##                                287                               2269 
##                      Social Worker         Student - College Freshman 
##                                595                                 34 
## Student - College Graduate Student           Student - College Junior 
##                                185                                 90 
##           Student - College Senior        Student - College Sophomore 
##                                153                                 58 
##        Student - Community College         Student - Technical School 
##                                 19                                 15 
##                            Teacher                     Teacher's Aide 
##                               3002                                217 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                 98                                366 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                780                                 82 
##                       Truck Driver                    Waiter/Waitress 
##                               1423                                345

## 
##               Other        Professional           Executive 
##               22931               10368                3400 
## Computer Programmer                     
##                3301                3190

Occupation: The Occupation selected by the Borrower at the time they created the listing.

This time we have 68 levels of the categories. Among top 5 is computer programmer and teachers. We can see that Other is highest by wide margin. This is understandble since this is joint occupation of borrowers that are not listed in the occupation category.

##              Cancelled             Chargedoff              Completed 
##                      4                  10457                  25925 
##                Current              Defaulted FinalPaymentInProgress 
##                  48042                   4644                    155 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     13                    619                    204 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    297                    237                    234

LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

Many of the loan status in this dataset are still running, loan completd comes second with 38074 unit. There are also loan status that past due about 2000 loans. We also see from the bar chart, Current has the highest count, which means many borrowers is currently in loans. We see very small number of users which their loans past due, based on the data, suggest that all of borrowers has fulfill their debt.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7000  1.0000  1.0000  0.9985  1.0000  1.0040

PercentFunded: Percent the listing was funded.

This is percent funded of the loans. It seems not good in any way since most of the data is already 1% funded. Minimum at 0.7% and maximum at 1.02%. We could guess that it may floating number representing the percentage, but with 1.02% it shouldn’t be possible. In the density plot, we see that it centered around 1.0, which means that most of the loans listed is actually get funded completely.

## 
##           A    AA     B     C     D     E    HR    NC 
## 67003  2711  2893  3500  4474  4188  2782  3148   132

CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.

This is essentially the same as Prosper Rating except that credit grade is rating given to loans pre-2009 and ProsperRating given after 2009. While looking at the frequency table we see that credit grade fill with empty value, that is the loans listed after July 2009. Since in this plot we only want to see credit grade, I exclude the loans listed after that date. What I can see now is Credit Grade that shown like a normal distribution(I can infer that since this is ordinal categorical), centered at C. NC is an outlier, which in this case “No Score”.

Univariate Analysis

What is the structure of your dataset?

There are 113937 loans in this dataset and 81 dataset (I only use 10 among them). I rename the columns to make it short, and drop duplicate data so 1 observations consist of 1 person.

Other observations:

It’s unlikely people with 2 millions salary have low loan amount
Many people makes in ProsperLoan listing loan money to pay their another loan, as shown by higher number of Debt Consolidations
Computer programmer and teachers are among top 5 occupation that listing the loans the most.

What is/are the main feature(s) of interest in your dataset?

The main features of the dataset is the Prosper rating. I like to what are features among these 13 that will best play as predictor for the Prosper rating.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I want to test the features that only exist when borrower listing the loans. Features like income, loan, occupation, listing category, recommendations, borrower rate, whether borrowers is home owner, number of payments late, and term is features that I like to get better idea for seeing whether the features contribute to predict posper rating. Because prosper rating is the feature of interest, I exclude the missing value of prosper rating for bivariate analysis.

Did you create any new variables from existing variables in the dataset?

For this dataset, I don’t create new feature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There is unusual features in this dataset, in particular, monthly income. This is stated(human input) by the borrowers. As such, we see many unusual salary. there’s borrower who’s stated their monthly income close to two millions but makes 35000 dolars loan. There’s also person who who makes 0 dolars monthly salary. So I tidy the features by excluding 5% below and 95% above. We know that income will be long tail right skewed, and the distance will be very far among salary. So I use log 10 transformations. The result after excluding outliers and log 10 transformation produces normal distribution.

Bivariate Plots Section

Let’s take a look at them in details

We see that eventhough rating isn’t properly scattered,(only 6 different rating), as the rate lower, the rating is increasing. And we see that it has strong negative linear relationship. Strong correlation is indicated by . This is can be explained as the interest rate is the one where the loaner have benefit from the money that he/she loans.

While the distribution of the rating between homeowner and not homeowner is the same (you can see the median is the same for both category), rating of homeowner has higher Q3 then those who do not.

This is interesting. Rating for 1 and 2 has really small IQR compared to the rest ot the rating. Rating 3,4 has no outlier. Rating 5,6 has similar outlier, and rating 7 has one outlier. With correlation of 0.47, it shows positive moderately strong linear relationship, and indeed we see as the rating increase, the loan amount has longer and longer tail.

When looking at the heatmap above, I see that lot of the borrower have lower rating, as it confirms with the univariate chart that we saw earlier. I see almost no correlation however, increase number of recommendations doesn’t affect much of the Prosper Rating. This also correlates with 0.05 correlation between both variables.

PaymentsLate should also be an indication where higher payments late by borrower should seen as bad rating. However, prosper rating shown weak relationship, although the outliers shown an indication stated.

## df$Homeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2917    4167    4876    6000  394400 
## -------------------------------------------------------- 
## df$Homeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4167    5833    6885    8167 1750000

Then I try faceting the homeowner categorical by monthly salary. What I see is that salary is higher to home owner people to those who don’t. Since we know that the income is long tail right skewed, mean is not representing most people. Instead median do. In this summary we see that both median and mean both higher for home owner people in term of income.

And in the box plot, we see that having home will definitely has higher salary income as well, as expected. The outlier in this data is so high, and it’s overshadow the interquartile range. By looking at the boxplot, I see that it’s expected the outlier to stay within 50.000. Well StatedMonthlyIncome is the borrower that self-stated his income, I begin to wonder people who has close to 2 million dolars for his monthly income. I categorize this as bad outlier in extreme case, and replot the boxplot to stay within 50000 dolars.

There you go, now it looks like the people who home owner still have higher median salary. still there’s many of outliers in both boxplot, since we expect that the income is long tail.

## 
##  Pearson's product-moment correlation
## 
## data:  df$Rate and df$Income
## t = -22.813, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09534881 -0.08031441
## sample estimates:
##         cor 
## -0.08783661

I see no relationship betwen both relationship. We can see that in the figure as interest rate increase, shows no sign of increase in the borrower’s income. Additionally, the correlation also approximate to zero which shows almost no correlation. Interestingly, when I add a regression line, I can see that income slightly decrease as the interest rate increase. So we see in regression line based on the data, people with lower income tend to have higher interest rate.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Almost all of them has no strong relationship with the prosper rating. Interest rate however shows strong relationship to prosper rating. Based on data, borrowers’ interes rate explains 91.15% of the variance in prosper rating.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

When observing the relationship between whether person is homeowner and their monthly stated income, it’s becoming clear that those who have home has higher income both median and mean. Observing the data, with median, salary of homeowner is 1700 dolars higher than those who doesn’t have a home.

What was the strongest relationship you found?

The strongest relationship shown by interest rate and rating, with negative strong linear relationship. The second one is relationship with original loan amount and the rating, which shows positive moderately strong linear relationship. Both loan amount and interest rate can be used when want to infer rating, since it moves in the opposite direction.

Multivariate Plots Section

I see that the Income now affects the rating as well, but not with a wide margin. The lowest rating is peak around the center of the Income. but as the rating goes up, the center of the income turns out also change, shifted towards the right making it left skewed distribution.

While it also apparent that the lower the rating, the rarer it is based on the density plot. Note that we have huge peak for rating lowest rating when I plot it against interest rate and loan amount. 1 is HR in the prosper rating, which means high risk. Interestingly, we see that borrower with high interest rate, will receive low rating from Prosper.

But that is not the case with the loan amount. Even if the loan amount is really high, doesn’t make it high risk.

Income/rating shown no unusual difference when plotting against homeowner.

As we can see that those without homeowner tend to have higher interest rate, and thus lower rating. While homeowner tends to have lower interest rate and higher rating. This also makes us believe that homeowner is safer bet for people who loans their money.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Monthly income indeed strengthen the Prosper rating. We see that the center of the distribution between each rating is shifted to the right everytime the rating is going up.

Number of payments late also affecting the rating. The rating will go up as the borrowers has higher number of late payments.

Were there any interesting or surprising interactions between features?

High amount of loans should generate high risk, since it’s risky if someone doesn’t pay the the loans. But here the rating doesn’t event consider the loan amount. Even the opposite, low loan amount could generate higher risk.

Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3430    5000    5935    7083 1750000

## [1] 4000

Description One

I choose histogram since income is numerical variable. Histogram is used when we want to see the distribution.Salary Income is expected to be highly right skewed, since fewer and fewer people have higher salary. But it’s human input, so it’s expected to be an outlier as well. After exclude the outliers and log10 scale the income, we have tidier normal distribution.

Since this is right skewed, I use quantiles to describe it more statiscally. We see that minimum monthly salary is 0, and this is not to be expected. Income is stated manually by the borrower, so there’s some users that prefer not to fill it in, and it’s default to zero. The other thing is where most of the users in the interquartile range is in thousands, there are income that flies as high as 1.75 million but makes loan of 4000 dolars. Clearly person that have this income shouldn’t make such a low loan. And since this is human input, there’s high chance this person is not being honest. Because of this distribution and outliers stated, I choose to log10 scale to make it normal distribution.

Plot Two

## 
##  Pearson's product-moment correlation
## 
## data:  df$Rate and df$Rating
## t = -830.36, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9553976 -0.9540569
## sample estimates:
##        cor 
## -0.9547321

Description Two

Looking at the statistics, Prosper Rating is strongly correlated with borrower’s interest rate, with negative linear relationship. This is the highest correlation among the all other features. This suggest that as interest rate decrease, the Prosper Rating receive higher grade, and with R^2 of 0.91, means that 91% of Prosper rating variance can be explain by borrower’s interest rate in the data.

I use box plot and differentiate the charts by Prosper Rating. The reason to do this is I want to see skewed distribution of borrowers’ interest rate across Prosper Rating. Looking at the chart, the lowest Prosper rating have highest median of interest rate. It has no outliers in above Q3. Meanwhile the highest rating has the smallest IQR compared to the rest of the rating. This suggest that the interest rate in smallest rating only in small range. If we observe summary statistics of interest rate at highest rating,

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.0704  0.0789  0.0797  0.0849  0.2100

We see that the interest rate is minimum at 0.04, and only as high as 0.21

Plot Three

Description Three

I choose density plot to see where distribution is centered as Prosper Rating increase. While it’s not a very distinct trend, we see that the center of the distribution, is shifted towards the right when the rating goes higher. We can’t see the normal distribution like this unless we exclude 5% and 95% quantiles, and also log10 scale the income.

## 
##  Pearson's product-moment correlation
## 
## data:  df$Income and df$Rating
## t = 22.654, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07970669 0.09474270
## sample estimates:
##        cor 
## 0.08722966

Eventhough we see some different trend in the density plot, The correlation test shows weak positive linear relationship.

Reflection

The dataset is from the Prosper loans, where people could loans money by listing it in the website, specifying all of the requirement field. The columns specified by the borrowers, among the 13 features I’m exploring are homeowner, income, and occupation. Of course since this is human input, we should expect that there will be human error. We see that income that is stated by the borrowers, could be as high as close to 2 million dollars, and when I investigate this, the borrower just ask for 35000 dolars. This is very unlikely, given his high monthly salary. Income is expected to be right skewed as well, so I scale log10 the income and also exclude the outlier. The features of interest is Prosper Rating, that is what features that contributed to the Prosper rating system. The rating is given immediately after the loans is listed, so I only concern about those features that actually exist at the time of the listing. So among all of the features, I select borrower’s interest rate, homeowner, listing category, borrower’s income, number of late payments, number of recommendation, loans’ term, and the occupation of the borrower. Among all of this features, the interest rate is strongly correlated, with lower interest rate contribute to higher prosper rating. I’m actually surprise that loans amount is not contributed significantly to the rating. I expect that the higher the loans, the higher the risk the loaner should have.

Occupation and Listing Category should be the features for predicting Prosper rating. But this is hard since both have many categorical variable. People with more interesting job could have higher rating. But too many occupation have Other and Professional which is hard to defined. While listing category could also plays important role, this is also has too many categorical level. The significant of both variables with Prosper Rating can be tested using ANOVA, but it’s outside the scope of the analysis. This analysis also succeed in selecting the features and the correlation with the feature of interest, which in this case is Prosper Rating. There is whole 81 features in the dataset, and I only select 13 variables in this dataset. It’s interesting additional features to predict the rating of the Prosper system.

EDA on Prosper Loan Data

Napitupulu Jon

November 27, 2015