Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
This colab shows you how to use Towhee to predict whether the bank issues a credit card to the applicant, and it has a good following on Kaggle as well.
First load the data as Dataframe for further processing.
import pandas as pd
record = pd.read_csv("../input/credit-card-approval-prediction/credit_record.csv", encoding = 'utf-8')
data = pd.read_csv("../input/credit-card-approval-prediction/application_record.csv", encoding = 'utf-8')
Find the first month that users' data were recorded and rename the column with a more understandable name.
begin_month=pd.DataFrame(record.groupby(["ID"])["MONTHS_BALANCE"].agg(min))
begin_month=begin_month.rename(columns={'MONTHS_BALANCE':'begin_month'})
Process the STATUS
column to find out if candidates have the record of overdue. Here is a table describe what each label stands for:
record.loc[record['STATUS']=='X', 'STATUS']=-1
record.loc[record['STATUS']=='C', 'STATUS']=-1
record.loc[record['STATUS']=='0', 'STATUS']=0
record.loc[record['STATUS']=='1', 'STATUS']=1
record.loc[record['STATUS']=='2', 'STATUS']=2
record.loc[record['STATUS']=='3', 'STATUS']=3
record.loc[record['STATUS']=='4', 'STATUS']=4
record.loc[record['STATUS']=='5', 'STATUS']=5
record.groupby('ID')['STATUS'].max().value_counts(normalize=True)
The result:
0 0.754202
-1 0.129455
1 0.101838
2 0.007307
5 0.004241
3 0.001914
4 0.001044
Generally, users in risk should be less than 3%, thus those who overdue for more than 60 days should be marked as risk users.
record.loc[record['STATUS']>=2, 'dep_value']=1
record.loc[record['STATUS']<2, 'dep_value']=0
temp = record[['ID', 'dep_value']].groupby('ID').sum()
temp.loc[temp['dep_value']!=0, 'dep_value']='Yes'
temp.loc[temp['dep_value']==0, 'dep_value']= 'No'
temp.value_counts(normalize=True)
The result:
dep_value
No 0.985495
Yes 0.014505
Merge the information into one dataframe, and mark those risk users with target 1
while other users 0
. We will regard the target
column as result. Meanwhile, we should drop those rows with missing values to avoid disturb.
new_data=pd.merge(data,begin_month,how="left",on="ID")
new_data=pd.merge(new_data, temp,how='inner',on='ID')
new_data['target']=new_data['dep_value']
new_data.loc[new_data['target']=='Yes','target']=1
new_data.loc[new_data['target']=='No','target']=0
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
new_data.head()
Before we get started, we should take a rough look at our samples, in case inbalanced data leading to a weird result.
new_data = new_data.dropna()
new_data['target'].value_counts()
The result:
0 24712
1 422
Obviously the data are extremely in balance, so we'll need to resample the data.
from imblearn.over_sampling import SMOTEN
X = new_data[['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_MOBIL',
'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'begin_month','dep_value']]
y = new_data['target'].astype('int')
X_balance,y_balance = SMOTEN().fit_resample(X, y)
X_balance = pd.DataFrame(X_balance, columns = X.columns)
X_balance.insert(0, 'target', y_balance)
new_data = X_balance
In the following part, we will use Towhee's DataCollection API to deal with the processed data. DataCollection provides a series API to support training, prediction, and evaluating with machine learning models.
Also, Towhee has encapsulated several models as built-in operators, which we will introduce in the following training part.
Users can use from_df
to load data from a dataframe then split the data into training set and test set with split_train_test
, the ratio is 9:1 by default.
import towhee
out = towhee.from_df(new_data).unstream()
out = (
out.runas_op['DAYS_BIRTH', 'years_birth'](func=lambda x: -int(x)//365)
.runas_op['DAYS_EMPLOYED', 'years_employed'](func=lambda x: -int(x)//365)
)
train, test = out.split_train_test()
Another important API is runas_op
, which enables users running self-defined functions as operators effortless.
In the example above, we defined a lambda function to calculate age and work experience (years) from given info (days), and feed it to runas_op
. In this way, DataCollection will wrap and register it as an operator and execute.
Note that the content inside the []
are the input and output columns, i.e. input column is DAYS_BIRTH
and the output will be stored in age
. This is a tricky part in DataCollection design. In many cases we might need to process some columns inside DataCollection, so we introduced []
to tell the DC which columns we are dealing with.
Usage:
['input', 'output']
['input', ('output_1, output_2')]
[('input_1', 'input_2'), 'output']
[('input_1', 'input_2'), ('output_1, output_2')]
Then we need to process some data, include:
def feature_extract(dc):
return (
dc.num_discretizer['CNT_CHILDREN', 'childnum'](n_bins=3)
.num_discretizer['AMT_INCOME_TOTAL', 'inc'](n_bins=5)
.num_discretizer['years_birth', 'age'](n_bins=5)
.num_discretizer['years_employed', 'worktm'](n_bins=5)
.num_discretizer['CNT_FAM_MEMBERS', 'fmsize'](n_bins=5)
.cate_one_hot_encoder['NAME_INCOME_TYPE', 'inctp']()
.cate_one_hot_encoder['OCCUPATION_TYPE', 'occyp']()
.cate_one_hot_encoder['NAME_HOUSING_TYPE', 'houtp']()
.cate_one_hot_encoder['NAME_EDUCATION_TYPE', 'edutp']()
.cate_one_hot_encoder['NAME_FAMILY_STATUS', 'famtp']()
.cate_one_hot_encoder['CODE_GENDER', 'gender']()
.cate_one_hot_encoder['FLAG_OWN_CAR', 'car']()
.cate_one_hot_encoder['FLAG_OWN_REALTY', 'realty']()
.tensor_hstack[('childnum', 'inc', 'age', 'worktm', 'fmsize',
'inctp', 'occyp', 'houtp', 'edutp', 'famtp',
'gender', 'car', 'realty'), 'fea']()
)
Towhee encapsulates several machine learning models as built-in operators so that users can easily access to. In this tutorial, we will user logistic regression, decision tree, and support vector machine.
Before trainging the model, make sure the DataCollection is set to training mode with dc.set_training()
.
To train a model, Let's take logistic regression as an example:
logistic_regression[('feature', 'actual_result'), 'predict_result'](name = 'model_name', kwargs)
[]
specifies the input and output columns;train = feature_extract(train.set_training())
train.logistic_regression[('fea', 'target'), 'lr_predict'](name = 'logistic', max_iter=10) \
.decision_tree[('fea', 'target'), 'dt_predict'](name = 'decision_tree', splitter = 'random', max_depth = 10) \
.svc[('fea', 'target'), 'svm_predict'](name = 'svm_classifier', C = 0.8, kernel='rbf', probability=True)
test = feature_extract(test.set_evaluating(train.get_state()))
The trained models are ready and the states are properly stored, we can predict with them and see how they works.
First, make sure the DC is set to evaluating mode. Then we can run prediction with exact same API as training. There are serval points to pay attention to:
set_evaluating()
, we need to pass the _state
of the DataCollection train
we trained in previous step where all the model states are stored in;with_metrics
function allow users to clarify the metrics they are interested in;metrics = test.set_evaluating(train._state) \
.logistic_regression[('fea', 'target'), 'lr_predict'](name='logistic') \
.decision_tree[('fea', 'target'), 'dt_predict'](name='decision_tree') \
.svc[('fea', 'target'), 'svm_predict'](name='svm_classifier') \
.with_metrics(['accuracy', 'recall', 'confusion_matrix']) \
.evaluate['target', 'lr_predict']('lr') \
.evaluate['target', 'dt_predict']('dt') \
.evaluate['target', 'svm_predict']('svm') \
.report()
Towhee also support self-defined algorithm operators, here is a example of xgboost classifier we present to ensemble the results from logistic regression, decision tree and support vector machine.
register
decorator to register your operator for future calling;StatefulOperator
with a name;fit
function and predict
function is required;from towhee import register
from towhee.operator import StatefulOperator
from xgboost import XGBClassifier
from scipy import sparse
import numpy as np
@register
class XGBClassifierOperator(StatefulOperator):
def __init__(self, name):
super().__init__(name=name)
def fit(self):
X = sparse.vstack(self._data[0])
y = np.array(self._data[1]).reshape([-1, 1])
self._state.model = XGBClassifier(n_estimators=100)
self._state.model.fit(X, y)
def predict(self, *arg):
return self._state.model.predict(arg[0])[0]
Stack the prediction results from lr, dt and svm into a new feature as the input of our ensemble model.
for i in ['lr_predict', 'dt_predict','svm_predict']:
train.runas_op[i,i](func=lambda x: x[0])
test.runas_op[i,i](func=lambda x: x[0])
train = train.tensor_hstack[('lr_predict', 'dt_predict','svm_predict'), 'ensemble_predict']()
test = test.tensor_hstack[('lr_predict', 'dt_predict', 'svm_predict'), 'ensemble_predict']()
train.set_training().XGBClassifierOperator[('ensemble_predict', 'target'), 'xgb_predict'](name='XGBClassifierOperator')
metrics = test.set_evaluating(train._state) \
.XGBClassifierOperator[('ensemble_predict', 'target'), 'xgb_predict'](name='XGBClassifierOperator') \
.with_metrics(['accuracy', 'recall', 'confusion_matrix']) \
.evaluate['target', 'xgb_predict']('xgb') \
.report()
The result from previous process is not extremely satsifying, so we can futher improve the performance by analysing the features. DC provides a feature_summarize
API to help understanding data. Then we should extract these feature according to their pattern to discretize properly.
train.feature_summarize['CNT_CHILDREN', 'AMT_INCOME_TOTAL','CNT_FAM_MEMBERS', 'years_birth', 'years_employed'](target='target')
The table summarize the values of the columns listed in []
:
All
: number of this value appeared in this column;Good
: number of this value in this column with target
labelled ‘0';Bad
: number of this value in this column with target
labelled ‘1';Share
: the ratio of this value in this column;Bad Rate
: Bad
/All
;Distribution Good
: Good
/ number of target
labelled ‘0';Distribution Bad
: Bad
/ number of target
labelled ‘1';WoE
: Weight of Evidence. A measure of how much the evidence supports or undermines a hypothesis. WOE measures the relative risk of an attribute of binning level. The value depends on whether the value of the target variable is a nonevent or an event.IV
: Information Value, measures the variable's ability to predict. The IV values of the various types are the difference between the conditional positive rate and the conditional negative rate multiplied by the WOE value of the variable. The total IV value of the variable can be understood as the weighted sum of the conditional positive rate and the conditional negative rate difference.