Data Dictionary
There are multiple variables in the dataset: divided in 3 categories:
Demographic information about customers
customer_id - Customer id
vintage - Vintage of the customer with the bank in number of days
age - Age of customer
gender - Gender of customer
dependents - Number of dependents
occupation - Occupation of the customer
city - City of customer (anonymised)
Customer Bank Relationship
customer_nw_category - Net worth of customer (3:Low 2:Medium 1:High)
branch_code - Branch Code for customer account
days_since_last_transaction - No of Days Since Last Credit in Last 1 year
Transactional Information
current_balance - Balance as of today
previous_month_end_balance - End of Month Balance of previous month
average_monthly_balance_prevQ - Average monthly balances (AMB) in Previous Quarter
average_monthly_balance_prevQ2 - Average monthly balances (AMB) in previous to previous quarter
current_month_credit - Total Credit Amount current month
previous_month_credit - Total Credit Amount previous month
current_month_debit - Total Debit Amount current month
previous_month_debit - Total Debit Amount previous month
current_month_balance - Average Balance of current month
previous_month_balance - Average Balance of previous month
churn - Average balance of customer falls below minimum balance in the next quarter (1/0)
Churn Prediction using Logisitic Regression
Now, that I understand the dataset in detail. It is time to build a logistic regression model to predict the churn. I have included the data dictionary above for reference.
Load Data & Packages for model building & preprocessing
Preprocessing & Missing value imputation
Select features on the basis of EDA Conclusions & build baseline model
Decide Evaluation Metric on the basis of business problem
Build model using all features & compare with baseline
Use Reverse Feature Elimination to find the top features and build model using the top 10 features & compare
Loading Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, roc_curve, precision_score, recall_score, precision_recall_curve
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
Loading Data
df_churn = pd.read_csv('bank_customer_data.csv') |
Missing Values
Before I go on to build the model, I must look for missing values within the dataset as treating the missing values is a necessary step before I fit a logistic regression model on the dataset.
pd.isnull(df_churn).sum() customer_id 0
vintage 0
age 0
gender 525
dependents 2463
occupation 80
city 803
customer_nw_category 0
branch_code 0
days_since_last_transaction 3223
current_balance 0
previous_month_end_balance 0
average_monthly_balance_prevQ 0
average_monthly_balance_prevQ2 0
current_month_credit 0
previous_month_credit 0
current_month_debit 0
previous_month_debit 0
current_month_balance 0
previous_month_balance 0
churn 0
dtype: int64 |
The result of this function shows that there are quite a few missing values in columns gender, dependents, city, days since last transaction and Percentage change in credits. I go through each of them 1 by 1 to find the appropriate missing value imputation strategy for each of them.
Gender For a quick recall look at the categories within gender column
df_churn['gender'].value_counts() Male 16548
Female 11309
Name: gender, dtype: int64 |
So there is a good mix of males and females and arguably missing values cannot be filled with any one of them. I could create a seperate category by assigning the value -1 for all missing values in this column.
Before that, first I will convert the gender into 0/1 and then replace missing values with -1
#Convert Gender dict_gender = {'Male': 1, 'Female':0} df_churn.replace({'gender': dict_gender}, inplace = True) df_churn['gender'] = df_churn['gender'].fillna(-1) |
Dependents, occupation and city with mode
Next I will have a quick look at the dependents & occupations column and impute with mode as this is sort of an ordinal variable
df_churn['dependents'].value_counts() 0.0 21435
2.0 2150
1.0 1395
3.0 701
4.0 179
5.0 41
6.0 8
7.0 3
9.0 1
52.0 1
36.0 1
50.0 1
8.0 1
25.0 1
32.0 1
Name: dependents, dtype: int64 |
df_churn['occupation'].value_counts()
self_employed 17476
salaried 6704
student 2058
retired 2024
company 40
Name: occupation, dtype: int64
df_churn['dependents'] = df_churn['dependents'].fillna(0)
df_churn['occupation'] = df_churn['occupation'].fillna('self_employed')
Similarly City can also be imputed with most common category 1020
df_churn['city'] = df_churn['city'].fillna(1020) |
Days since Last Transaction
A fair assumption can be made on this column as this is number of days since last transaction in 1 year, I can substitute missing values with a value greater than 1 year say 999
df_churn['days_since_last_transaction'] = df_churn['days_since_last_transaction'].fillna(999) |
Preprocessing
Now, before applying linear model such as logistic regression, I need to scale the data and keep all features as numeric strictly.
Dummies with Multiple Categories
# Convert occupation to one hot encoded features df_churn = pd.concat([df_churn,pd.get_dummies(df_churn['occupation'],prefix = str('occupation'),prefix_sep='_')],axis = 1) |
Scaling Numerical Features for Logistic Regression
Now, I remember that there are a lot of outliers in the dataset especially when it comes to previous and current balance features. Also, the distributions are skewed for these features if you recall from the EDA. I will take 2 steps to deal with that here:
Log Transformation
Standard Scaler
Standard scaling is anyways a necessity when it comes to linear models and I have done that here after doing log transformation on all balance features.
num_cols = ['customer_nw_category', 'current_balance', 'previous_month_end_balance', 'average_monthly_balance_prevQ2', 'average_monthly_balance_prevQ', 'current_month_credit','previous_month_credit', 'current_month_debit', 'previous_month_debit','current_month_balance', 'previous_month_balance'] for i in num_cols: df_churn[i] = np.log(df_churn[i] + 17000) std = StandardScaler() scaled = std.fit_transform(df_churn[num_cols]) scaled = pd.DataFrame(scaled,columns=num_cols) |
df_df_og = df_churn.copy() df_churn = df_churn.drop(columns = num_cols,axis = 1) df_churn = df_churn.merge(scaled,left_index=True,right_index=True,how = "left") |
y_all = df_churn.churn df_churn = df_churn.drop(['churn','customer_id','occupation'],axis = 1) |
Model Building and Evaluation Metrics
Since this is a binary classification problem, I could use the following 2 popular metrics:
Recall
Area under the Receiver operating characteristic curve
Now, I am looking at the recall value here because a customer falsely marked as churn would not be as bad as a customer who was not detected as a churning customer and appropriate measures were not taken by the bank to stop him/her from churning The ROC AUC is the area under the curve when plotting the (normalized) true positive rate (x-axis) and the false positive rate (y-axis). My main metric here would be Recall values, while AUC ROC Score would take care of how well predicted probabilites are able to differentiate between the 2 classes.
Conclusions from Exploratory Data Analysis (EDA)
For debit values, I see that there is a significant difference in the distribution for churn and non churn and it might be turn out to be an important feature
For all the balance features the lower values have much higher proportion of churning customers
For most frequent vintage values, the churning customers are slightly higher, while for higher values of vintage, I have mostly non churning customers which is in sync with the age variable
I see significant difference for different occupations and certainly would be interesting to use as a feature for prediction of churn.
Now, I will first split my dataset into test and train and using the above conclusions select columns and build a baseline logistic regression model to check the ROC-AUC Score & the confusion matrix
Baseline Columns
baseline_cols = ['current_month_debit', 'previous_month_debit','current_balance','previous_month_end_balance','vintage' ,'occupation_retired', 'occupation_salaried','occupation_self_employed', 'occupation_student'] |
df_baseline = df_churn[baseline_cols] |
Train Test Split to create a validation set
# Splitting the data into Train and Validation set xtrain, xtest, ytrain, ytest = train_test_split(df_baseline,y_all,test_size=1/3, random_state=11, stratify = y_all) |
model = LogisticRegression() model.fit(xtrain,ytrain) pred = model.predict_proba(xtest)[:,1] |
AUC ROC Curve & Confusion Matrix
from sklearn.metrics import roc_curve fpr, tpr, _ = roc_curve(ytest,pred) auc = roc_auc_score(ytest, pred) plt.figure(figsize=(12,8)) plt.plot(fpr,tpr,label="Validation AUC-ROC="+str(auc)) x = np.linspace(0, 1, 1000) plt.plot(x, x, linestyle='-') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend(loc=4) plt.show() |
# Confusion Matrix pred_val = model.predict(xtest) |
label_preds = pred_val cm = confusion_matrix(ytest,label_preds) def plot_confusion_matrix(cm, normalized=True, cmap='bone'): plt.figure(figsize=[7, 6]) norm_cm = cm if normalized: norm_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] sns.heatmap(norm_cm, annot=cm, fmt='g', xticklabels=['Predicted: No','Predicted: Yes'], yticklabels=['Actual: No','Actual: Yes'], cmap=cmap) plot_confusion_matrix(cm, ['No', 'Yes']) |
# Recall Score recall_score(ytest,pred_val) |
0.11580148317170565 |
Cross validation
Cross Validation is one of the most important concepts in any type of data modelling. It simply says, try to leave a sample on which you do not train the model and test the model on this sample before finalizing the model.
I divide the entire population into k equal samples. Now I train models on k-1 samples and validate on 1 sample. Then, at the second iteration I train the model with a different sample held as validation.
In k iterations, I have basically built the model on each sample and held each of them as validation. This is a way to reduce the selection bias and reduce the variance in prediction power.
Since it builds several models on different subsets of the dataset, I can be more certain of model performance if I use CV for testing my models.
def cv_score(ml_model, rstate = 12, thres = 0.5, cols = df_churn.columns): i = 1 cv_scores = [] df1 = df_churn.copy() df1 = df_churn[cols]
# 5 Fold cross validation stratified on the basis of target kf = StratifiedKFold(n_splits=5,random_state=rstate,shuffle=True) for df_index,test_index in kf.split(df1,y_all): print('\n{} of kfold {}'.format(i,kf.n_splits)) xtr,xvl = df1.loc[df_index],df1.loc[test_index] ytr,yvl = y_all.loc[df_index],y_all.loc[test_index]
# Define model for fitting on the training set for each fold model = ml_model model.fit(xtr, ytr) pred_probs = model.predict_proba(xvl) pp = []
# Use threshold to define the classes based on probability values for j in pred_probs[:,1]: if j>thres: pp.append(1) else: pp.append(0)
# Calculate scores for each fold and print pred_val = pp roc_score = roc_auc_score(yvl,pred_probs[:,1]) recall = recall_score(yvl,pred_val) precision = precision_score(yvl,pred_val) sufix = "" msg = "" msg += "ROC AUC Score: {}, Recall Score: {:.4f}, Precision Score: {:.4f} ".format(roc_score, recall,precision) print("{}".format(msg))
# Save scores cv_scores.append(roc_score) i+=1 return cv_scores |
baseline_scores = cv_score(LogisticRegression(), cols = baseline_cols) |
1 of kfold 5 ROC AUC Score: 0.7644836090843695, Recall Score: 0.0751, Precision Score: 0.5766 2 of kfold 5 ROC AUC Score: 0.779451238310554, Recall Score: 0.0751, Precision Score: 0.6695 3 of kfold 5 ROC AUC Score: 0.7551621478942728, Recall Score: 0.1350, Precision Score: 0.6425 4 of kfold 5 ROC AUC Score: 0.7582070977015274, Recall Score: 0.1169, Precision Score: 0.6508 5 of kfold 5 ROC AUC Score: 0.7632311004249608, Recall Score: 0.1112, Precision Score: 0.5850 |
Now I will try using all columns available to check if I get significant improvement.
all_feat_scores = cv_score(LogisticRegression()) |
1 of kfold 5 ROC AUC Score: 0.7322735587298325, Recall Score: 0.1093, Precision Score: 0.5066 2 of kfold 5 ROC AUC Score: 0.7681477751515774, Recall Score: 0.1968, Precision Score: 0.6809 3 of kfold 5 ROC AUC Score: 0.7392333107476944, Recall Score: 0.1673, Precision Score: 0.5714 4 of kfold 5 ROC AUC Score: 0.7394851378820373, Recall Score: 0.1597, Precision Score: 0.6667 5 of kfold 5 ROC AUC Score: 0.758833273580065, Recall Score: 0.1730, Precision Score: 0.5987 |
There is some improvement in both ROC AUC Scores and Precision/Recall Scores. Now I can try backward selection to select the best subset of features which give the best score.
Reverse Feature Elimination or Backward Selection
I have already built a model using all the features and a separate model using some baseline features. I can try using backward feature elimination to check if I can do better. I will do that next.
from sklearn.feature_selection import RFE import matplotlib.pyplot as plt # Create the RFE object and rank each feature model = LogisticRegression() rfe = RFE(estimator=model, n_features_to_select=1, step=1) rfe.fit(df_churn, y_all) |
RFE(estimator=LogisticRegression(), n_features_to_select=1) |
ranking_df = pd.DataFrame() ranking_df['Feature_name'] = df_churn.columns ranking_df['Rank'] = rfe.ranking_ |
ranked = ranking_df.sort_values(by=['Rank']) |
ranked |
The balance features are proving to be very important as can be seen from the table. The RFE function can also be used to select features. I selected the top 10 features from this table and checked score.
rfe_top_10_scores = cv_score(LogisticRegression(), cols = ranked['Feature_name'][:10].values) |
1 of kfold 5 ROC AUC Score: 0.7986881101633954, Recall Score: 0.2281, Precision Score: 0.7362 2 of kfold 5 ROC AUC Score: 0.8050442914397288, Recall Score: 0.2234, Precision Score: 0.7556 3 of kfold 5 ROC AUC Score: 0.7985130070256687, Recall Score: 0.2205, Precision Score: 0.7250 4 of kfold 5 ROC AUC Score: 0.7935095616193245, Recall Score: 0.2120, Precision Score: 0.7360 5 of kfold 5 ROC AUC Score: 0.7942222838028076, Recall Score: 0.1911, Precision Score: 0.6745 |
Wow, the top 10 features obtained using the reverse feature selection are giving a much better score than any of our earlier attempts. This is the power of feature selection and it especially works well in case of linear models as tree based models are in itself to some extent capable of doing feature selection.
The recall score here is quite low. I should play around with the threshold to get a better recall score. AUC ROC depends on the predicted probabilities and is not impacted by the threshold. I will try 0.2 as threshold which is close to the overall churn rate
cv_score(LogisticRegression(), cols = ranked['Feature_name'][:10].values, thres=0.14) |
1 of kfold 5 ROC AUC Score: 0.7986881101633954, Recall Score: 0.8308, Precision Score: 0.2836 2 of kfold 5 ROC AUC Score: 0.8050442914397288, Recall Score: 0.8375, Precision Score: 0.2902 3 of kfold 5 ROC AUC Score: 0.7985130070256687, Recall Score: 0.8279, Precision Score: 0.2897 4 of kfold 5 ROC AUC Score: 0.7935095616193245, Recall Score: 0.8213, Precision Score: 0.2840 5 of kfold 5 ROC AUC Score: 0.7942222838028076, Recall Score: 0.8108, Precision Score: 0.2927 |
[0.7986881101633954, 0.8050442914397288, 0.7985130070256687, 0.7935095616193245, 0.7942222838028076] |
I observe that there is continuous improvement in the Recall Score. However, clearly precision score is going down. On the basis of business requirement the bank can take a call on deciding the threshold. Without knowing the metrics relevant to the business, the best course of action is to optimize for AUC ROC Score so as to find the best probabilities here.
Comments