Customer Churn Prediction

14 minute read

Why Customer retention is important? source

1. Save Money On Marketing
2. Repeat Purchases From Repeat Customers Means Repeat Profit
3. Free Word-Of-Mouth Advertising
4. Retained Customers Will Provide Valuable Feedback
5. Previous Customers Will Pay Premium Prices. 

Why and when will a customer leave his/her bank could be a challenging question to answer.

Here, we have a data from kaggle where all the historical information about a customer and whether he/she left the bank or not is available.

Our goal is to use the power of data science to help the bank identify those who are likely to leave the bank in future.

Read Data

# source - https://www.kaggle.com/adammaus/predicting-churn-for-bank-customers
bank_data = pd.read_csv('Churn_Modelling.csv')
bank_data.head()
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0
# Dimensions
print("Number of Rows: {} \nNumber of Columns: {}".format(bank_data.shape[0],bank_data.shape[1]))

Number of Rows: 10000

Number of Columns: 14

# data types, missing values and number of uniques
bank_data_info = pd.concat([pd.DataFrame(bank_data.dtypes),pd.DataFrame(bank_data.isnull().sum()),pd.DataFrame(bank_data.nunique())],axis = 1)
bank_data_info.columns = ['DataType','# missing rows','# Unique values']
print(bank_data_info)
del bank_data_info

                DataType  # missing rows  # Unique values
RowNumber          int64               0            10000
CustomerId         int64               0            10000
Surname           object               0             2932
CreditScore        int64               0              460
Geography         object               0                3
Gender            object               0                2
Age                int64               0               70
Tenure             int64               0               11
Balance          float64               0             6382
NumOfProducts      int64               0                4
HasCrCard          int64               0                2
IsActiveMember     int64               0                2
EstimatedSalary  float64               0             9999
Exited             int64               0                2

The data has 10000 rows and columns. Let’s see the data description.

Data Description

1. RowNumber: Just a index number assigned to each row. Type : int64
2. CustomerId: Id of each customer of the bank. Type : int64
3. Surname: Surname of the customer. Type : Object
4. CreditScore: The measure of an individual's ability to payback the borrowed amount. Higher it is the better. Type : int64
5. Geography: Country of the customer. Type : Object. Values: [France, Germany, Spain]
6. Gender: Customer's gender. Type : Object. Values: [Male / Female]
7. Age: Age of the customer. Type : int64
8. Tenure: Duration for which the loan amount is sanctioned.Assuming it to be in years Type : int64
9. Balance: The amount of money the customer has available in his account. Type: int64
10. NumOfProducts: How many accounts, bank account affiliated products the person has. Type: int64
11. HasCrCard: whether the person holds a credit card or not. 1 means he/she has a credit card and 0 means he/she doesn't. Type: int64
12. IsActiveMember: Whether the customer is actively using the account. However, the values are subjective. Type: int64
13. EstimatedSalary: The person's approximate salary. Type: float64
14. Exited: Whether the customer has left the bank or not. 1 means he/she left and 0 means he/she didn't. Type: int64

From the above, we will not require RowNumber, CustomerId, and Surname are related to individuals.

Memory Handling

Memory usage in python is a key task. In case of huge datasets memory handling is not easy. It is always a good practice to reduce memory of the data.

# Before Memory reduction
print("Total memory used before Memory reduction {:5.2f}Mb".format(bank_data.memory_usage().sum() / 1024**2))

Total memory used before Memory reduction 0.84Mb

# After Memory reduction
bank_data = reduce_memory(bank_data)
print("Total memory used after Memory reduction {:5.2f}Mb".format(bank_data.memory_usage().sum() / 1024**2))

Memory usage decreased to 0.31 Mb (63.6% reduction) Total memory used after Memory reduction 0.31Mb

Exploratory Data Analysis (EDA)

The purpose of EDA is to understand how different variables are related to our target (Exited) variable.

import plotly.graph_objects as go

labels = ['Exited','Continued']
values =  [bank_data.Exited[bank_data['Exited']==1].count(), bank_data.Exited[bank_data['Exited']==0].count()]
colors = ['red', 'darkorange']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_traces(hole=.4, hoverinfo='label+value',  textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(
    title_text="Ratio of Customer Churned and Retained")
fig.show()
fig.write_html(fig, file='pie_chart.html', auto_open=True)

Pie Chart

Gender

Who is loyal to the bank? Male or Female?

Gender,HasCrCard,IsActiveMember vs Churn

f,ax=plt.subplots(3,2,figsize=(18,25))
bank_data[['Gender','Exited']].groupby(['Gender']).mean().plot.bar(ax=ax[0][0])
ax[0][1].set_title('Churn vs Gender')
sns.countplot('Gender',hue='Exited',data=bank_data,ax=ax[0][1])
ax[0][1].set_title('Gender:Churned vs Retained')
bank_data[['HasCrCard','Exited']].groupby(['HasCrCard']).mean().plot.bar(ax=ax[1][0])
ax[1][0].set_title('Churn vs HasCrCard')
sns.countplot('HasCrCard',hue='Exited',data=bank_data,ax=ax[1][1])
ax[1][1].set_title('HasCrCard: Churned vs Retained')
bank_data[['IsActiveMember','Exited']].groupby(['IsActiveMember']).mean().plot.bar(ax=ax[2][0])
ax[2][0].set_title('Churn vs IsActiveMember')
sns.countplot('IsActiveMember',hue='Exited',data=bank_data,ax=ax[2][1])
ax[2][1].set_title('IsActiveMember: Churned vs Retained')
plt.show()

Gender,HasCrCard,IsActiveMember vs Churn

From the above graphs we can see,

  1. More male customers, but when it comes to churn rate, female customers are more likely to quit the bank. (In other words, even though there are more male customers its the females who have high churn rate compared to males).

  2. Majority of customers have credit cards.

  3. The bank have a significant number of inactive customers. They ratio of inactive customers being churned out is high. Thus bank needs to take steps and make them active.

Geography vs Churn

Geography vs Churn

The bank have majority of its customers located in France, however the chrun rate is high in Germany followed by spain, where the bank have less number of customers. This can be due to less number of branches in Germany and Spain or poor services in those regions.

Age

# peaks for Exited/not exited customers by their age
facet = sns.FacetGrid(bank_data, hue="Exited",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, bank_data['Age'].max()))
facet.add_legend()

# average exited customers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_age = bank_data[["Age", "Exited"]].groupby(['Age'],as_index=False).mean()
average_age.columns = ['Age','Mean(Exited)']
sns.barplot(x='Age', y='Mean(Exited)', data=average_age)
del average_age

Peaks for age

average exited age

Customer having age around 48 to 60 are churning out compared to younger ones i.e., Mean(Exited) > 0.5 from the graph. The churn rate can also be due to retirement. Bank needs to revise the market strategy by focusing on keeping older customers.

bank_data[(bank_data['Age'] == 56)]['Exited'].value_counts()
1    50
0    20
Name: Exited, dtype: int64

Out of 70 people whose age equals 56, 50 of them churned out.

Tenure

y0 = bank_data.Tenure[bank_data.Exited == 0].values
y1 = bank_data.Tenure[bank_data.Exited == 1].values

fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Continued',
                marker_color = 'lightseagreen'))
fig.add_trace(go.Box(y=y1, name = 'Exited',
                marker_color = 'indianred'))

fig.update_layout(
    yaxis_title='Tenure'
)

fig.show()

Tenure

With respect to Tenure, Customers repaying loans in less years or taking more time to repay loans are churning out. Bank needs to provide benefits for customers who are repaying loans in quick time and for those who have stayed for a long time (High tenure).

Balance

Balance

We can see that Customers having balance even after churning out.

Even though the gap is not very high yet the Customers with high account balance are churning out. This may even be due to the data imbalance. To be on the safer side the bank needs to address this as it will impact their profit.

Correlation check!

cordata = bank_data.corr(method ='pearson')
cordata.style.background_gradient(cmap='summer')

Cor Plot

Age has some correlation (28.5%) with Churn rate. There are no real highly correlated features. NumberOfProducts and Balance are 30.4% Correlated among themselves. Apart from that there are no multi-collinear features which is good.

Let’s try and create some features [feature engineering]. Let’s See if we can create more useful features.

Feature Engineering

One of the most important part in a data science/ml pipeline is the ability to create good features. Feature Engineering is the hard skill and this is where the creativity & knowledge of data scientist/ML practioner is required.

Train, Validation and Test Split

Before we create feature we will split the data into Train, CV and Test sets. Otherwise, there will this problem of data leakage.

# train cv test split - stratified sampling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y,random_state = 17)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.20, stratify=y_train,random_state = 17)

# reset index
X_train = X_train.reset_index(drop = True)
X_cv = X_cv.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

print("train dimensions: ",X_train.shape, y_train.shape)
print("cv dimensions: ",X_cv.shape, y_cv.shape)
print("test dimensions: ",X_test.shape, y_test.shape)
train dimensions:  (6400, 10) (6400,)
cv dimensions:  (1600, 10) (1600,)
test dimensions:  (2000, 10) (2000,)

We will create features on train cv and test. Note: fit must only happen on train set.

# CreditScore and Age -- if a young man has a high credit score?
X_train['creditscore_age_ratio'] = X_train['CreditScore'] / X_train['Age']
X_cv['creditscore_age_ratio'] = X_cv['CreditScore'] / X_cv['Age']
X_test['creditscore_age_ratio'] = X_test['CreditScore'] / X_test['Age']

# log feature?
X_train['creditscore_age_ratio_log'] = np.log10(X_train['creditscore_age_ratio'])
X_cv['creditscore_age_ratio_log'] = np.log10(X_cv['creditscore_age_ratio'])
X_test['creditscore_age_ratio_log'] = np.log10(X_test['creditscore_age_ratio'])

# Given his/her age does he/she have a better credit score
mean_age = np.mean(X_train['Age']) # use mean of train data for cv and test set
mean_credit = np.mean(X_train['CreditScore']) # use mean of train data for cv and test set
.
.
.
.
# higher salary compared to his/her age?
mean_salary = np.mean(X_train['EstimatedSalary']) # mean sestimated alary of train set
X_train['high_salary_age'] = [1 if (i > mean_salary and j < mean_age) else 0 for i,j in zip(X_train['EstimatedSalary'],X_train['Age'])]
X_cv['high_salary_age'] = [1 if (i > mean_salary and j < mean_age) else 0 for i,j in zip(X_cv['EstimatedSalary'],X_cv['Age'])]
X_test['high_salary_age'] = [1 if (i > mean_salary and j < mean_age) else 0 for i,j in zip(X_test['EstimatedSalary'],X_test['Age'])]

print("New features created!")
X_train.head(3)

New features created!
CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary balance_salary_ratio balance_or_not creditscore_age_ratio creditscore_age_ratio_log Better_Age_Credit Better_Age_Credit_Active multi_products Valuable_customer tenure_age_ratio high_salary_age
0 640 Spain Male 43 9 172478.156250 1 1 0 191084.406250 0.902628 1 14.883721 1.172712 0 0 0 0 0.209302 0
1 850 France Female 24 6 0.000000 2 1 1 13159.900391 0.000000 0 35.416667 1.549208 1 1 1 1 0.250000 0
2 494 Germany Female 38 7 174937.640625 1 1 0 40084.320312 4.364241 1 13.000000 1.113943 0 0 0 0 0.184211 0

3 rows × 22 columns

We have created some features, but we don’t know how useful it will be. We will check their influence at the time of creating Models.

Also note that ratio features like balance_salary_ratio, creditscore_age_ratio, etc may create multi-collinearity.

Note: It has been stated that multi-collinearity is not an issue when using sckit-learn models. Source

Data Preprocessing

Preparing data for Model building.! Encoding categorical data and normalizing numerical data.

Encoding Categorical data: Gender and Geography

# Encoding Gender
X_train['Gender'] = X_train['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
X_cv['Gender'] = X_cv['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
X_test['Gender'] = X_test['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
# One Hot Encoding - Geography
from sklearn.preprocessing import OneHotEncoder

# left-to-right column order is alphabetical (France, Germany, Spain)
ohe = OneHotEncoder(sparse=False)
X_train_geo_ohe = pd.DataFrame(ohe.fit_transform(X_train[['Geography']]),columns = ['France', 'Germany', 'Spain'])
X_cv_geo_ohe = pd.DataFrame(ohe.transform(X_cv[['Geography']]),columns = ['France', 'Germany', 'Spain'])
X_test_geo_ohe = pd.DataFrame(ohe.transform(X_test[['Geography']]),columns = ['France', 'Germany', 'Spain'])

# drop the Geography column
X_train.drop('Geography',axis=1,inplace = True)
X_cv.drop('Geography',axis=1,inplace = True)
X_test.drop('Geography',axis=1,inplace = True)

# Concat the One Hot encoded columns
X_train = pd.concat([X_train, X_train_geo_ohe],axis = 1)
X_cv = pd.concat([X_cv, X_cv_geo_ohe],axis = 1)
X_test = pd.concat([X_test, X_test_geo_ohe],axis = 1)

Data Standardization

Each feature is of different scales/units. Features like Salary and Balance have higher range of values compared to Age, Tenure. We need to standardise features before feeding them into our Models.

from sklearn.preprocessing import StandardScaler
# features to standardise
cols_norm = ['CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary','balance_salary_ratio',
            'creditscore_age_ratio','creditscore_age_ratio_log','tenure_age_ratio']

sc = StandardScaler()
sc.fit(X_train[cols_norm]) # fit has to happen only on train set

X_train[cols_norm] = sc.transform(X_train[cols_norm])
X_cv[cols_norm] = sc.transform(X_cv[cols_norm])
X_test[cols_norm] = sc.transform(X_test[cols_norm])

print("Standardized!")
X_train.head()
Standardized!

Model

We will try multiple types of algorithms for this data.

  1. Logistic Regression
  2. XGBoost
  3. XGBoost with SMOTE Oversampling

Logistic Regression

We will use sklearn’s Logistic Regression with Hyperparameter Tuning

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
train_auc = []
cv_auc = []
C_list = [0.0001, 0.001 ,0.01 ,0.1, 1, 10, 100, 1000] # list of various inverse of lambda values we want to compare.
for i in tqdm(C_list): # for each values of C_list (i)
    lr = LogisticRegression(C=i,class_weight='balanced', random_state=17, solver = 'liblinear') # initialize Logistic Model with C = i
    lr.fit(X_train, y_train) # fit the lr model on the train data
    y_train_pred = batch_predict(lr, X_train) # Predict on the train data    
    y_cv_pred = batch_predict(lr, X_cv) # Predict on cross validation data
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs        
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv, y_cv_pred))

plt.plot(C_list, train_auc, label='Train AUC')
plt.plot(C_list, cv_auc, label='CV AUC')
plt.xscale('log') # change the scale of x axis
plt.autoscale(True)
plt.scatter(C_list, train_auc, label='Train AUC points')
plt.scatter(C_list, cv_auc, label='CV AUC points')
plt.xscale('log')
plt.legend()
plt.xlabel("C: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
100%|██████████| 8/8 [00:00<00:00, 16.11it/s]

LR

From the above image we can see that at C = 10 the train and cross validation AUC scores are high as well as close to each other. Therefore we can set the optimum C as 10.

After building the Logistic Regression model using C = 10 on train data, we get the AUC scores on train and test as shown below.

LR Test

Logistic Regression - Test AUC:  0.836560751814989

Logistic Regression Feature importances

> In Logistic Regression when the absolute weights are large then w.T*x is also large. 

> Weights of the feature indicate their importance. 
Weights Abs_Weights Features
15 -5.567203 5.567203 multi_products
12 -2.543284 2.543284 creditscore_age_ratio_log
5 2.352321 2.352321 NumOfProducts
20 1.474438 1.474438 Germany
7 -1.005711 1.005711 IsActiveMember
0 0.949018 0.949018 CreditScore
11 0.769292 0.769292 creditscore_age_ratio
2 -0.733009 0.733009 Age
16 0.698971 0.698971 Valuable_customer
1 -0.605773 0.605773 Gender
21 0.581897 0.581897 Spain
19 0.533779 0.533779 France
18 -0.498421 0.498421 high_salary_age
13 -0.425378 0.425378 Better_Age_Credit
17 0.339650 0.339650 tenure_age_ratio
3 -0.323940 0.323940 Tenure
8 0.173293 0.173293 EstimatedSalary
14 -0.154712 0.154712 Better_Age_Credit_Active
10 -0.092624 0.092624 balance_or_not
9 0.023764 0.023764 balance_salary_ratio
4 0.010782 0.010782 Balance
6 0.002682 0.002682 HasCrCard

From the absolute weights we can see features like ‘multi_products’,’creditscore_age_ratio_log’,’NumOfProducts’,’Germany’, ‘IsActiveMember’ are most important and features like ‘HasCrCard’,’Balance’, and ‘balance_salary_ratio’ are least important.

So feature engineering has helped alot.

Let’s drop the least important features and build the model again. From the absolute weights we can see features like ‘multi_products’,’creditscore_age_ratio_log’,’NumOfProducts’,’Germany’, ‘IsActiveMember’ are most important and features like ‘HasCrCard’,’Balance’, and ‘balance_salary_ratio’ are least important.

So feature engineering has helped alot.

Let’s drop the least important features and build the model again.

LR2 Test

Logistic Regression - Test AUC:  0.8359900732782088

Now lets predict on test data with best threshold and see the confusion matrix.

The maximum value of tpr*(1-fpr) 0.5908667088827037 for threshold 0.488

Confusion matrix: Train data

LR2 CM

## Classification report
from sklearn.metrics import classification_report
cr = classification_report(y_test,predict_with_best_t(y_test_pred, best_t))
print(cr)
              precision    recall  f1-score   support

           0       0.93      0.76      0.84      1593
           1       0.45      0.76      0.57       407

    accuracy                           0.76      2000
   macro avg       0.69      0.76      0.70      2000
weighted avg       0.83      0.76      0.78      2000

XGBoost Model

XGBoost documentation - Here

XGboost tutorial - Here

import xgboost as xgb

train_auc = []
cv_auc = []
n_trees = [50,100,200,300,500,750,1000] # list of various n_estimator values we want to compare.
l_rate = [0.0001,0.001,0.01,0.1,1.0,10] # list of various learning rate values we want to compare.

hyperparams = [(n,l) for n in n_trees for l in l_rate]
hyperparameter_indices = uniform_random_sample(0,len(hyperparams),size = 15)
params_list = [hyperparams[i] for i in hyperparameter_indices]
n_trees_ls = [i[0] for i in params_list]
l_rate_ls = [i[1] for i in params_list]

for i in tqdm(params_list):
    # initialize XGBoost Model with n_estimators = i[0] and learning_rate = i[1]
    xg = xgb.XGBClassifier(n_estimators=i[0],learning_rate=i[1],random_state = 17) 
    xg.fit(X_train, y_train) # fit the Xgboost model on the train data
    y_train_pred = batch_predict(xg, X_train) # Predict on the train data    
    y_cv_pred = batch_predict(xg, X_cv) # Predict on cross validation data
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs        
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv, y_cv_pred))
100%|██████████| 15/15 [00:48<00:00,  3.20s/it]

XGB 1

From the above plot we can see the best set of parameters are best_n_estimator = 500 & best_l_rate = 0.01

After building the XGBoost classifier using best_n_estimator = 500 & best_l_rate = 0.01, we get Train and Test AUC scores as follows:

XGB1 AUC

XGBoost --> Train AUC:  0.9399423702940356
XGBoost --> Test AUC:  0.8581987226055022

Confusion Matrix – XGBoost Model

The maximum value of tpr*(1-fpr) 0.7502817077751346 for threshold 0.224

Confusion matrix: Test data

XGB1 CM

## Classification report -- XGBoost Model
XGBoost Model:
               precision    recall  f1-score   support

           0       0.92      0.84      0.88      1593
           1       0.53      0.71      0.61       407

    accuracy                           0.81      2000
   macro avg       0.72      0.78      0.74      2000
weighted avg       0.84      0.81      0.82      2000

Model with oversampling – SMOTE Sampling

from imblearn.over_sampling import SMOTE # run "conda install -c conda-forge imbalanced-learn" in anaconda promt
sm  = SMOTE(random_state=17)
X_train_sm,y_train_sm = sm.fit_sample(X_train,y_train)
print(X_train_sm.shape,y_train_sm.shape)
(10192, 22) (10192,)

Repeating the same XGBoost model building process with Oversampled data, we get:

The maximum value of tpr*(1-fpr) 0.7712574380374055 for threshold 0.466

Confusion matrix: Test data

XGB2 CM

## Classification report -- XGBoost Model with SMOTE data
XGBoost Model:
                   precision    recall  f1-score   support

               0       0.92      0.84      0.88      1593
               1       0.53      0.70      0.61       407

        accuracy                           0.81      2000
       macro avg       0.73      0.77      0.74      2000
    weighted avg       0.84      0.81      0.82      2000

We can still improve the scores by adding/removing some more features or trying out other algorithms.

XGBoost Feature importance

XGB2 Feature imp

We can see that features like isActiveMember, NumofProducts & Age have major contributions towards the prediction.

Summary

+---------------------------+-----------------------------------------+--------------------+
|           Model           |             Hyper parameter             |   Auc on testset   |
+---------------------------+-----------------------------------------+--------------------+
|    Logistic Regression    |              1/λ or C = 10              | 0.8359900732782088 |
|          XGBoost          | n_estimator = 500, learning_rate = 0.01 | 0.8581987226055022 |
| XGBoost With Oversampling | n_estimator = 300, learning_rate = 0.01 | 0.8530649293361158 |
+---------------------------+-----------------------------------------+--------------------+

You can find full code here