click <code> icon
on the right corner to view the source code
Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides the “bridge” between investors and borrowers. For more basic information about the company please check out the wikipedia article about the company.
SIB(Small Industries Bank) loans money to companies in exchange for the promise of repayment. Some will default on the loans, being unable to repay them for some reason. The bank maintains insurance to reduce their risk of loss in the event of default. The insured amount may cover all or just some part of the loan amount. SIB wants to predict which companies will default on their loans based on their financial information. They have provided you with a dataset that consists of loan related information such as loan amount, term, and state. Also, there is company information such as the number of employees, operating sector, etc.
Using machine learning, predict which companies will default on their loans and explain how different features impact the predictions.
notebook setup to hide messages and warnings.
figure size was set to 4 x 2 inch.
set up ggplot theme for notebook.
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
fig.height=2,
fig.width=4)
# set up theme for ggplot
my.theme <- theme(
panel.background = element_rect(fill = "transparent"), # bg of the panel
plot.background = element_rect(fill = "transparent"), # bg of the plot
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank(), # get rid of minor grid
legend.background = element_rect(fill = "transparent", color = NA), # get rid of legend bg
legend.box.background = element_rect(fill = "transparent", color = NA)) # get rid of legend panel
there are two datasets:
In train.csv
:
the first 6 rows of train.csv look like this:
df <- fread('./data/train.csv')
df %>%
select(-id) %>%
head() %>%
kable(caption = 'raw data') %>%
kable_styling('striped',full_width = T)
industry | state | request_date | term | employee_count | business_new | business_type | location | other_loans | loan_amount | insured_amount | default_status |
---|---|---|---|---|---|---|---|---|---|---|---|
Others | VA | 27-Apr-10 | 34 | 4 | New | 0 | Rural | N | $35,000.00 | $35,000.00 | 1 |
Manufacturing | CA | 05-Nov-09 | 107 | 1 | New | 0 | Rural | N | $15,000.00 | $13,500.00 | 1 |
Trading | CA | 26-Feb-10 | 84 | 1 | New | 0 | Rural | Y | $265,000.00 | $100,000.00 | 0 |
Engineering | MI | 10-Jun-10 | 240 | 21 | New | 0 | Rural | N | $255,000.00 | $255,000.00 | 0 |
Education | NH | 23-Sep-10 | 36 | 1 | Existing | 0 | Rural | N | $13,300.00 | $6,650.00 | 0 |
Administration | VA | 24-Dec-09 | 60 | 42 | New | 0 | Rural | Y | $40,000.00 | $20,000.00 | 0 |
The given data is Some column type conversion was done to clean the data:
Date
type: “30-Dec-09” => “2019-12-30”clean_data <- function(df){
df %>%
mutate(
request_date = as.Date(request_date, format = '%d-%B-%y'),
loan_amount = as.numeric(gsub('[$,]','', loan_amount)),
insured_amount = as.numeric(gsub('[$,]','', insured_amount)))
}
df <- clean_data(df)
df %>%
select(-id) %>%
head() %>%
kable(caption = 'cleaned data') %>%
kable_styling('striped',full_width = T)
industry | state | request_date | term | employee_count | business_new | business_type | location | other_loans | loan_amount | insured_amount | default_status |
---|---|---|---|---|---|---|---|---|---|---|---|
Others | VA | 2010-04-27 | 34 | 4 | New | 0 | Rural | N | 35000 | 35000 | 1 |
Manufacturing | CA | 2009-11-05 | 107 | 1 | New | 0 | Rural | N | 15000 | 13500 | 1 |
Trading | CA | 2010-02-26 | 84 | 1 | New | 0 | Rural | Y | 265000 | 100000 | 0 |
Engineering | MI | 2010-06-10 | 240 | 21 | New | 0 | Rural | N | 255000 | 255000 | 0 |
Education | NH | 2010-09-23 | 36 | 1 | Existing | 0 | Rural | N | 13300 | 6650 | 0 |
Administration | VA | 2009-12-24 | 60 | 42 | New | 0 | Rural | Y | 40000 | 20000 | 0 |
summary of train.csv
df %>%
select(-id) %>%
dfSummary(style = 'grid', plain.ascii = F, tmp.img.dir = './tmp', headings = F)
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | industry [character] |
1. Trading 2. Construction 3. Consulting 4. Hotel 5. Manufacturing 6. Healthcare 7. Others 8. Administration 9. Transportation 10. Agriculture [ 7 others ] |
508 (21.1%) 275 (11.4%) 267 (11.1%) 245 (10.2%) 245 (10.2%) 219 ( 9.1%) 168 ( 7.0%) 125 ( 5.2%) 106 ( 4.4%) 58 ( 2.4%) 186 ( 7.7%) |
IIII II II II II I I I I |
2402 (100%) |
0 (0%) |
2 | state [character] |
1. CA 2. TX 3. NY 4. OH 5. FL 6. MN 7. WI 8. IL 9. UT 10. MA [ 40 others ] |
293 (12.2%) 191 ( 8.0%) 133 ( 5.5%) 102 ( 4.2%) 95 ( 4.0%) 91 ( 3.8%) 80 ( 3.3%) 79 ( 3.3%) 77 ( 3.2%) 74 ( 3.1%) 1187 (49.4%) |
II I I IIIIIIIII |
2402 (100%) |
0 (0%) |
3 | request_date [Date] |
min : 2009-10-01 med : 2010-03-11 max : 2010-09-30 range : 11m 29d |
271 distinct values | : : : . : : : . : : : : : : : . : : : : : : : : : : : : : : : : : : : : : : |
2402 (100%) |
0 (0%) |
4 | term [integer] |
Mean (sd) : 87.6 (62.7) min < med < max: 1 < 84 < 312 IQR (CV) : 43 (0.7) |
155 distinct values | : : : : : . : : : : : : : . . |
2402 (100%) |
0 (0%) |
5 | employee_count [integer] |
Mean (sd) : 9.3 (21.2) min < med < max: 0 < 4 < 500 IQR (CV) : 7 (2.3) |
88 distinct values | : : : : : |
2402 (100%) |
0 (0%) |
6 | business_new [character] |
1. Existing 2. New |
715 (29.8%) 1687 (70.2%) |
IIIII IIIIIIIIIIIIII |
2402 (100%) |
0 (0%) |
7 | business_type [integer] |
Min : 0 Mean : 0.1 Max : 1 |
0 : 2270 (94.5%) 1 : 132 ( 5.5%) |
IIIIIIIIIIIIIIIIII I |
2402 (100%) |
0 (0%) |
8 | location [character] |
1. Rural | 2402 (100.0%) | IIIIIIIIIIIIIIIIIIII | 2402 (100%) |
0 (0%) |
9 | other_loans [character] |
1. N 2. Y |
1741 (72.5%) 661 (27.5%) |
IIIIIIIIIIIIII IIIII |
2402 (100%) |
0 (0%) |
10 | loan_amount [numeric] |
Mean (sd) : 204487.7 (364335.6) min < med < max: 100 < 50000 < 4e+06 IQR (CV) : 191492.5 (1.8) |
1031 distinct values | : : : : : . |
2402 (100%) |
0 (0%) |
11 | insured_amount [numeric] |
Mean (sd) : 155016.7 (311422.7) min < med < max: 1700 < 35000 < 4e+06 IQR (CV) : 112250 (2) |
735 distinct values | : : : : : . |
2402 (100%) |
0 (0%) |
12 | default_status [integer] |
Min : 0 Mean : 0.3 Max : 1 |
0 : 1629 (67.8%) 1 : 773 (32.2%) |
IIIIIIIIIIIII IIIIII |
2402 (100%) |
0 (0%) |
Compared with non-default loans, the default loans had relatively lower terms.
# 4.1 term histogram
df %>%
mutate(default_status = factor(default_status)) %>%
ggplot(aes(x= term, fill= default_status)) +
geom_histogram(bins=50) +
facet_grid(~default_status) +
my.theme
Loan amount and Insured amount are both having skewed distribution. Logarithm transformation is applied to bring them to normal-like distribution.
p1 <- df %>%
ggplot(aes(x=loan_amount)) +
geom_histogram(bins = 50, fill='lightblue') +
labs(title = 'Loan Amount') +
my.theme
p2 <- df %>%
ggplot(aes(x=insured_amount)) +
geom_histogram(bins = 50, fill='lightblue') +
labs(title = 'Insured Amount') +
my.theme
p3 <- p1 + scale_x_log10() + labs(title = 'Loan Amount (log scale)')
p4 <- p2 + scale_x_log10() + labs(title = 'Insured Amount (log scale)')
ggarrange(p1,p2,p3,p4, ncol = 2, nrow = 2)
The insured amount for the default insurance is directly correlated with loan amount of the loan.
df %>%
ggplot(aes(x = loan_amount, y = insured_amount)) +
geom_point(color = 'lightblue') +
geom_abline(slope = 1, intercept = 0, color = 'red', linetype = 'dashed') +
labs(title = 'Insured Amount vs. Loan Amount',
x = 'loan amount',
y = 'insured amount') +
my.theme
Lower loan amount tends to have high default rate
# Loan Amount effect on default
df %>%
ggplot(aes(x = log10(loan_amount), fill = factor(default_status))) +
geom_density(alpha = 0.5) +
labs(fill = 'default status')+
my.theme
The default rate is observed stable from 2019-10 to 2010-09, though the volume of loans showed some variation in months.
res <- df %>%
mutate(ym = format(request_date, '%Y-%m-01'),
ym = as.Date(ym)) %>%
group_by(ym) %>%
summarise(volume = n(),
default_rate = mean(default_status))
res %>%
ggplot(aes(x= ym, y=volume)) +
geom_bar(stat = 'identity', fill = 'orange', group= 1) +
geom_line(aes(y = default_rate / max(res$default_rate) * max(res$volume)), stat = 'identity', color='red', group = 1) +
geom_point(aes(y = default_rate / max(res$default_rate) * max(res$volume)), stat = 'identity', color='red', group = 1) +
scale_y_continuous(labels = function(i) format(i, big.mark = ','),
sec.axis = sec_axis(~./max(res$volume) * max(res$default_rate),
labels = scales::percent,
name = 'Default Rate'
)) +
scale_x_date()+
labs(x='Requested Date',y='Volume') +
my.theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Construction industry has highest default rate, while energy industry has the lowest
df %>%
group_by(industry) %>%
summarise(default_rate= mean(default_status)) %>%
ggplot(aes(x=reorder(industry, default_rate), y=default_rate)) +
geom_bar(stat = 'identity', fill='orange') +
scale_y_continuous(labels = scales::percent) +
labs(x='Industry', y='Default Rate') +
coord_flip(expand = F) +
my.theme
New business tends to have higher default rate
p1 <- df %>%
group_by(business_new) %>%
summarise(default_rate = mean(default_status)) %>%
ggplot(aes(x=business_new, y=default_rate)) +
geom_bar(stat = 'identity', fill='orange') +
scale_y_continuous(labels=scales::percent) +
labs(title = 'New Business Type', x='New Business',y='Default Rate') +
coord_flip() +
my.theme
p2 <- df %>%
group_by(business_type) %>%
summarise(default_rate = mean(default_status)) %>%
ggplot(aes(x = business_type, y = default_rate)) +
geom_bar(stat = 'identity', fill = 'lightblue') +
scale_y_continuous(labels = scales::percent) +
labs(title = 'Business Type Effect', x = 'Business Type', y = 'Default Rate') +
coord_flip() +
my.theme
ggarrange(p1,p2, ncol = 2)
Combining train & test data for data cleansing and feature engineering.
## combine train and valid before convert to factor
df$flag <- 'train'
df.test <- fread('./data/test.csv') %>%
clean_data() %>%
mutate(default_status = 0, flag='test')
df.full <-rbind(df,df.test)
#creat features from request data
df.full <- df.full %>%
mutate(request_ym = format(request_date, '%Y-%m'))
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 minutes 47 seconds
H2O cluster timezone: Asia/Shanghai
H2O data parsing timezone: UTC
H2O cluster version: 3.26.0.2
H2O cluster version age: 7 months and 10 days !!!
H2O cluster name: H2O_started_from_R_xuelin_tms177
H2O cluster total nodes: 1
H2O cluster total memory: 1.77 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.6.1 (2019-07-05)
h2o.no_progress()
#convert data.table to h2o.frame
full.hex <- as.h2o(df.full)
#convert categorical variables to factor
full.hex$label <- h2o.asfactor(full.hex$default_status)
full.hex$industry <- h2o.asfactor(full.hex$industry)
full.hex$state <- h2o.asfactor(full.hex$state)
full.hex$business_new <- h2o.asfactor(full.hex$business_new)
full.hex$other_loans <- h2o.asfactor(full.hex$other_loans)
full.hex$request_ym <- h2o.asfactor(full.hex$request_ym)
#split train, test
train.hex <- full.hex[full.hex$flag == 'train',]
test.hex <- full.hex[full.hex$flag == 'test',]
#define feature set
feature.names <- c('industry','state','term','employee_count','request_ym','business_new',
'business_type', 'other_loans','loan_amount','insured_amount')
# 5.1 Linear Model- Logistic Regression Baseline
model_ridge <- h2o.glm(x=feature.names, y='label',
training_frame = train.hex,
family = 'binomial',
nfolds = 5, alpha = 0, lambda_search = T)
# auc
res_ridge <- h2o.auc(model_ridge, train=T, xval = T)
AUC of ROC - Logistic regression
train history:
model_ridge@model$scoring_history %>%
select(iteration, deviance_train, deviance_xval) %>%
tidyr::pivot_longer(cols = -iteration,
names_to = 'data',
values_to = 'loss') %>%
ggplot(aes(x = iteration, y = loss, color = data)) +
geom_line() +
geom_point() +
my.theme +
theme(legend.position = 'bottom')
h2o.varimp(model_ridge) %>%
head(10) %>%
ggplot(aes(x = reorder(variable, scaled_importance), y = scaled_importance)) +
geom_bar(stat = 'identity', fill = 'lightblue') +
coord_flip(expand = F) +
labs(x = 'variable')+
my.theme
h2o.varimp(model_ridge) %>% head(10) %>%
kable(caption = 'Logistic Regression Variable Importance') %>%
kable_styling('striped')
variable | relative_importance | scaled_importance | percentage |
---|---|---|---|
state.AZ | 0.9324776 | 1.0000000 | 0.0400847 |
term | 0.8169718 | 0.8761302 | 0.0351194 |
state.GA | 0.7939451 | 0.8514361 | 0.0341296 |
loan_amount | 0.7502973 | 0.8046277 | 0.0322533 |
industry.Agriculture | 0.7413452 | 0.7950274 | 0.0318684 |
employee_count | 0.7084459 | 0.7597458 | 0.0304542 |
industry.Hotel | 0.6575193 | 0.7051314 | 0.0282650 |
state.MA | 0.5651523 | 0.6060760 | 0.0242944 |
state.FL | 0.5474091 | 0.5870480 | 0.0235316 |
state.NH | 0.5375559 | 0.5764813 | 0.0231081 |
# 5.2 Xgboost Model -Tree Based Ensemble Model
model_xgb <- h2o.xgboost(x=feature.names,y='label',
training_frame = train.hex,
max_depth = 6, eta = 0.1,
stopping_metric = 'AUC',stopping_rounds = 21,
ntrees = 500, nfolds = 5)
#auc
res_xgboost <- h2o.auc(model_xgb, train = T, xval = T)
AUC of ROC - Xgboost
train history:
model_xgb@model$scoring_history %>%
select(number_of_trees, training_logloss, training_auc) %>%
tidyr::pivot_longer(cols = -number_of_trees,
names_to = 'data',
values_to = 'loss') %>%
ggplot(aes(x = number_of_trees, y = loss, color = data)) +
geom_line() +
geom_point() +
my.theme +
theme(legend.position = 'bottom')
h2o.varimp(model_xgb) %>%
head(10) %>%
ggplot(aes(x = reorder(variable, scaled_importance), y = scaled_importance)) +
geom_bar(stat = 'identity', fill = 'lightblue') +
coord_flip(expand = F) +
labs(title = 'variable importance of Xgboost model', x = 'variable') + my.theme
h2o.varimp(model_xgb) %>% head(10) %>%
kable(caption = 'Xgboost Variable Importance') %>%
kable_styling('striped')
variable | relative_importance | scaled_importance | percentage |
---|---|---|---|
term | 7935.17871 | 1.0000000 | 0.5450172 |
loan_amount | 2975.59180 | 0.3749874 | 0.2043746 |
insured_amount | 1447.49463 | 0.1824149 | 0.0994192 |
employee_count | 602.33594 | 0.0759070 | 0.0413706 |
other_loans.N | 275.76093 | 0.0347517 | 0.0189403 |
industry.Trading | 109.14555 | 0.0137546 | 0.0074965 |
business_new.Existing | 91.52952 | 0.0115347 | 0.0062866 |
industry.Hotel | 87.39921 | 0.0110141 | 0.0060029 |
state.GA | 75.84157 | 0.0095576 | 0.0052091 |
request_ym.2010-01 | 56.33741 | 0.0070997 | 0.0038695 |
Xgboost model is much better than Logistic Regression in terms of AUC. The ranking of top variables in the variable importance for both methods are generally aligned. Term, loan amount and insured amount are the most important variables in predicting the default status for a loan. Therefore, xgboost model is used for prediction.
Let’s view the sample of predictions from xgboost model.
caveat: a threshold of 0.5 was assumed to output the prediction class
id | p0 | p1 | predict |
---|---|---|---|
0 | 0.46 | 0.54 | default |
0 | 1.00 | 0.00 | repaid |
0 | 0.99 | 0.01 | repaid |
0 | 0.72 | 0.28 | repaid |
0 | 1.00 | 0.00 | repaid |
0 | 0.02 | 0.98 | default |
0 | 0.93 | 0.07 | repaid |
0 | 1.00 | 0.00 | repaid |
0 | 0.01 | 0.99 | default |
0 | 1.00 | 0.00 | repaid |