top of page

Portfolio

Writer's picturePatti Crosswait

Telco customer churn



Predicting customer churn is crucial for telecommunication companies to be able to effectively retain customers. It is more costly to acquire new customers than to retain existing ones. Large telecommunications corporations are seeking to develop models to predict which customers are more likely to change(churn) and take actions accordingly.


I built a model to predict how likely a customer will churn by analyzing its characteristics:

(1) demographic information

(2) account information (3) services information


The objective is to obtain a data-driven solution that will allow Telco to reduce churn rates and increase customer satisfaction and corporation revenue.


Dataset

The data set I used is available in Kaggle and contains nineteen columns (independent variables) that indicate the characteristics of the clients of a fictional telecommunications corporation.


The Churn column (response variable) indicates whether the customer departed within the last month or not. The class No includes the clients that did not leave the company last month, while the class Yes contains the clients that decided to terminate their relations with the company.


The objective of the analysis is to determine the relationship between the customer’s characteristics and the churn.


The original IBM data can be found in the following link: https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=samples-telco-customer-churn


Steps of the project

The project consists of the following sections:

1. Data Reading

2. Exploratory Data Analysis and Data Cleaning

3. Data Visualization

4. Feature Importance

5. Feature Engineering

6. Setting a baseline

7. Splitting the data in training and testing sets

8. Assessing multiple algorithms

9. Algorithm selected: Gradient Boosting

10. Hyperparameter tuning

11. Performance of the model

12. Drawing conclusions — Summary


1. Data Reading

The first step of the analysis, after importing packages, consists of reading and storing the data in a Pandas data frame using the pandas.read_csv function.

#Import packages

import pandas as pd

import matplotlib.pyplot as plt

import math

import sklearn

​# import telecom dataset into a pandas data frame

df_telco_churn = pd.read_csv('Telco_Customer_Churn.csv')


# visualize column names

df_telco_churn.columns

# check unique values of each column

for col in df_telco_churn.columns:

print('Column: {} - Unique Values: {}'.format(col, df_telco_churn[col].unique()))

As shown above, the data set contains 19 independent variables, which can be classified into 3 groups:

(1) Demographic Information gender: Whether the client is a female or a male (Female, Male) SeniorCitizen: Whether the client is a senior citizen or not (0, 1) Partner: Whether the client has a partner or not (Yes, No) Dependents: Whether the client has dependents or not (Yes, No)

(2) Customer Account Information tenure: Number of months the customer has stayed with the company (Multiple different numeric values) Contract: Indicates the customer’s current contract type (Month-to-Month, One year, Two year) PaperlessBilling: Whether the client has paperless billing or not (Yes, No) PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit Card (automatic)) MonthlyCharges: The amount charged to the customer monthly (Multiple different numeric values) TotalCharges: The total amount charged to the customer (Multiple different numeric values)

(3) Services Information PhoneService: Whether the client has a phone service or not (Yes, No) MultipleLines: Whether the client has multiple lines or not (No phone service, No, Yes) InternetServices: Whether the client is subscribed to Internet service with the company (DSL, Fiber optic, No) OnlineSecurity: Whether the client has online security or not (No internet service, No, Yes) OnlineBackup: Whether the client has online backup or not (No internet service, No, Yes) DeviceProtection: Whether the client has device protection or not (No internet service, No, Yes) TechSupport: Whether the client has tech support or not (No internet service, No, Yes) StreamingTV: Whether the client has streaming TV or not (No internet service, No, Yes) StreamingMovies: Whether the client has streaming movies or not (No internet service, No, Yes)


2. Exploratory Data Analysis and Data Cleaning


Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumptions before performing further evaluations.

Missing values and data types At the beginning of EDA, I want to know as much information as possible about the data, this is when the pandas.DataFrame.info method comes in handy. This method prints a concise summary of the data frame, including the column names and their data types, the number of non-null values, and the amount of memory used by the data frame.

# summary of the data frame

df_telco_churn.info()


As shown above, the data set contains 7043 observations and 21 columns. There are no null values on the data set; however, I observe that the column TotalCharges was wrongly detected as an object. This column represents the total amount charged to the customer and it is, therefore, a numeric variable. I need to transform this column into a numeric data type. I can use the pd.to_numeric function. This function raises an exception when it sees non-numeric data. I can use the argument errors='coerce' to skip those cases and replace them with a NaN.

# transform the column TotalCharges into a numeric data type

df_telco_churn['TotalCharges'] = pd.to_numeric(df_telco_churn['TotalCharges'], errors='coerce')


# null observations of the TotalCharges column

df_telco_churn[df_telco_churn['TotalCharges'].isnull()]​


I decided to remove observations that appear to be contradictory such as observations with a tenure of 0 and non-null TotalCharges

# drop observations with null values

df_telco_churn.dropna(inplace=True)

Removed customerID column since it is useless to explain whether not the customer will churn. Therefore, I drop this column from the data set.

# drop the customerID column from the dataset

df_telco_churn.drop(columns='customerID', inplace=True)

Payment method denominations

Some payment method denominations contain the text '(automatic)'. These denominations are too long to be used as tick labels in further visualizations. I removed this clarification from the entries of the PaymentMethod column


3. I analyzed the data by using visualization


Response Variable The following bar plot shows the percentage of observations that correspond to each class of the response variable: no and yes. This is an imbalanced data set because both classes are not equally distributed among all observations, being no the majority class (73.42%). When modeling, this imbalance will lead to a large number of false negatives.

# create a figure

afigure = plt.figure(figsize=(10, 6))

ax = afigure.add_subplot(111)


# proportion of observation of each class

proportion_response = df_telco_churn['Churn'].value_counts(normalize=True)


# create a bar plot showing the percentage of churn

proportion_response.plot(kind='bar',

ax=ax,

color=['springgreen','salmon'])


# set title and labels

ax.set_title('Proportion of observations of the response variable',

fontsize=18, loc='left')

ax.set_xlabel('churn',

fontsize=14)

ax.set_ylabel('proportion of observations',

fontsize=14)

ax.tick_params(rotation='auto')


# eliminate the frame from the plot

spine_names = ('top', 'right', 'bottom', 'left')

for spine_name in spine_names:

ax.spines[spine_name].set_visible(False)


I used normalized stacked bar plots to analyze the influence of each independent categorical variable in the outcome.


A normalized stacked bar plot makes each column the same height, so it is not useful for comparing total numbers; however, it is perfect for comparing how the response variable varies across all groups of an independent variable.


I also used histograms to evaluate the influence of each independent numeric variable in the outcome. The data set is imbalanced; therefore, I needed to draw a probability density function of each class (density=True) to be able to compare both distributions properly. Demographic Information


The following code creates a stacked percentage bar chart for each demographic attribute (gender, SeniorCitizen, Partner, Dependents), showing the percentage of Churn for each category of the attribute.

def percentage_stacked_plot(columns_to_plot, main_title):

'''

Prints a 100% stacked plot of the response variable for independent variable of the list columns_to_plot.


Parameters:

columns_to_plot (list of string): Names of the variables to plot

main_title (string): Super title of the visualization


Returns:

None

'''

number_of_columns = 2

number_of_rows = math.ceil(len(columns_to_plot)/2)


# create a figure

aFigure = plt.figure(figsize=(12, 5 * number_of_rows))

aFigure.suptitle(main_title, fontsize=22, y=.95)


# loop to each column name to create a subplot

for index, column in enumerate(columns_to_plot, 1):


# create the subplot

ax = aFigure.add_subplot(number_of_rows, number_of_columns, index)


# calculate the percentage of observations of the response variable for each group of the independent variable

# 100% stacked bar plot

proportion_by_independent = pd.crosstab(df_telco_churn[column], df_telco_churn['Churn']).apply(lambda x: x/x.sum()*100, axis=1)


proportion_by_independent.plot(kind='bar', ax=ax, stacked=True,

rot=0, color=['springgreen','salmon'])


# set the legend in the upper right corner

ax.legend(loc="upper right", bbox_to_anchor=(0.62, 0.5, 0.5, 0.5),

title='Churn', fancybox=True)


# set title and labels

ax.set_title('Proportion of observations by ' + column,

fontsize=16, loc='left')


ax.tick_params(rotation='auto')


# eliminate the frame from the plot

spine_names = ('top', 'right', 'bottom', 'left')

for spine_name in spine_names:

ax.spines[spine_name].set_visible(False)

​# demographic column names

demograph_columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']


# stacked plot of demographic columns

percentage_stacked_plot(demograph_columns, 'Demographic Information')


As shown above, each bar is a category of the independent variable, and it is subdivided to show the proportion of each response class (No and Yes).


I can extract the following conclusions by analyzing demographic attributes:

The churn rate of senior citizens is almost double that of young citizens. I do not expect gender to have significant predictive power.


A similar percentage of churn is shown both when a customer is a man or a woman. Customers with a partner churn less than customers with no partner.


Customer Account Information — Categorical variables

As I did with demographic attributes, I evaluated the percentage of Churn for each category of the customer account attributes (Contract, PaperlessBilling, PaymentMethod).

# customer account column names

account_columns = ['Contract', 'PaperlessBilling', 'PaymentMethod']


# stacked plot of customer account columns

percentage_stacked_plot(account_columns, 'Customer Account Information')

I can extract the following conclusions by analyzing customer account attributes:

Customers with month-to-month contracts have higher churn rates compared to clients with yearly contracts. Customers who opted for an electronic check as paying method are more likely to leave the company. Customers subscribed to paperless billing churn more than those who are not subscribed.


Customer Account Information — Numerical variables


The following plots show the distribution of tenure, MonthlyCharges, TotalCharges by Churn. For all numeric attributes, the distributions of both classes (No and Yes) are different which suggests that all of the attributes will be useful to determine whether or not a customer churns.

def histogram_plots(columns_to_plot, main_title):


'''

Prints a histogram for each independent variable of the list columns_to_plot.

Parameters:

columns_to_plot (list of string): Names of the variables to plot

super_title (string): Super title of the visualization

Returns:

None

'''

# set number of rows and number of columns

number_of_columns = 2

number_of_rows = math.ceil(len(columns_to_plot)/2)


# create a figure

aFigure = plt.figure(figsize=(12, 5 * number_of_rows))

aFigure.suptitle(main_title, fontsize=22, y=.95)


# loop to each demographic column name to create a subplot

for index, col in enumerate(columns_to_plot, 1):


# create the subplot

ax = aFigure.add_subplot(number_of_rows, number_of_columns, index)


# histograms for each class (normalized histogram)

df_telco_churn[df_telco_churn['Churn']=='No'][col].plot(kind='hist', ax=ax, density=True,

alpha=0.5, color='springgreen', label='No')

df_telco_churn[df_telco_churn['Churn']=='Yes'][col].plot(kind='hist', ax=ax, density=True,

alpha=0.5, color='salmon', label='Yes')

# set the legend in the upper right corner

ax.legend(loc="upper right", bbox_to_anchor=(0.5, 0.5, 0.5, 0.5),

title='Churn', fancybox=True)


# set title and labels

ax.set_title('Distribution of ' + column + ' by churn',

fontsize=16, loc='left')


ax.tick_params(rotation='auto')


# eliminate the frame from the plot

spine_names = ('top', 'right', 'bottom', 'left')

for spine_name in spine_names:

ax.spines[spine_name].set_visible(False)

# customer account column names

account_columns_numeric = ['tenure', 'MonthlyCharges', 'TotalCharges']

# histogram of costumer account columns

histogram_plots(account_columns_numeric, 'Customer Account Information')

I can extract the following conclusions by analyzing the histograms above:

The churn rate tends to be larger when monthly charges are high. New customers (low tenure) are more likely to churn. Clients with high total charges are less likely to leave the company.


Services Information


I evaluated the percentage of the target for each category of the services columns with stacked bar plots.

# services column names

services_columns = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',

'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']


# stacked plot of services columns

percentage_stacked_plot(services_columns, 'Services Information')


I extracted the following conclusions by evaluating services attributes:

I do not expect phone attributes (PhoneService and MultipleLines) to have significant predictive power. The percentage of churn for all classes in both independent variables is nearly the same. Clients with online security churn less than those without it. Customers with no tech support tend to churn more often than those with tech support. By looking at the plots above, I could identify the most relevant attributes for detecting churn. I expect these attributes to be discriminative in future models.


4. Feature importance


Mutual information — analysis of linear and nonlinear relationships


Mutual information measures the mutual dependency between two variables based on entropy estimations. In machine learning, I am interested in evaluating the degree of dependency between each independent variable and the response variable. Higher values of mutual information show a higher degree of dependency which indicates that the independent variable will be useful for predicting the target.

The Scikit-Learn library has implemented mutual information in the metrics package. The following code computes the mutual information score between each categorical variable of the data set and the Churn variable.

import sklearn.metrics


# function that computes the mutual infomation score between a categorical series and the column Churn

def compute_mutual_information(categorical_series):

return sklearn.metrics.mutual_info_score(categorical_series, df_telco_churn.Churn)


# select categorial variables excluding the response variable

categorical_vars = df_telco_churn.select_dtypes(include=object).drop('Churn', axis=1)


# compute the mutual information score between each categorical variable and the target

feature_importance = categorical_vars.apply(compute_mutual_information).sort_values(ascending=False)


# visualize feature importance

print(feature_importance)


Mutual information allows not only better understanding of data but also to identify the predictor variables that are completely independent of the target.


Gender, PhoneService, and MultipleLines have a mutual information score really close to 0, meaning those variables do not have a strong relationship with the target.

This information is in line with the conclusions I have previously drawn by visualizing the data.


I removed those variables from the data set before training as they do not provide useful information for predicting the outcome.


The mutual information extends the notion of correlation to nonlinear relationships since, unlike Pearson’s correlation coefficient, this method is able to detect not only linear relationships but also nonlinear ones.

5. Feature Engineering

Feature engineering is the process of extracting features from the data and transforming them into a format that is suitable for the machine learning model. In this project, I needed to transform both numerical and categorical variables. Most machine learning algorithms require numerical values; therefore, all categorical attributes available in the dataset should be encoded into numerical labels before training the model. I also needed to transform numeric columns into a common scale. This prevented the columns with large values dominating the learning process. The techniques implemented in this project are described in more detail below. All transformations are implemented using only Pandas; however, there are multiple ways to solve the same problem.


No modification

The SeniorCitizen column is already a binary column and should not be modified.


Label Encoding

Label encoding is used to replace categorical values with numerical values.


This encoding replaces every category with a numerical label. I used label encoding with the following binary variables:

(1) gender, (2) Partner, (3) Dependents, (4)PaperlessBilling, (5)PhoneService , and (6)Churn

df_telco_churn_transformed = df_telco_churn.copy()


# label encoding (binary variables)

label_encoding_columns = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService', 'Churn']


# encode categorical binary features using label encoding

for column in label_encoding_columns:

if column == 'gender':

df_telco_churn_transformed[column] = df_telco_churn_transformed[column].map({'Female': 1, 'Male': 0})

else:

df_telco_churn_transformed[column] = df_telco_churn_transformed[column].map({'Yes': 1, 'No': 0})

One-Hot Encoding


One-hot encoding creates a new binary column for each level of the categorical variable.


The new column contains zeros and ones indicating the absence or presence of the category in the data.


I applied one-hot encoding to the following categorical variables: (1) Contract, (2) PaymentMethod, (3) MultipleLines, (4) InternetServices, (5) OnlineSecurity, (6) OnlineBackup, (7) DeviceProtection, (8) TechSupport, (9) StreamingTV, and (10)StreamingMovies.

# one-hot encoding (categorical variables with more than two levels)

one_hot_encoding_columns = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',

'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']


# encode categorical variables with more than two levels using one-hot encoding

df_telco_churn_transformed = pd.get_dummies(df_telco_churn_transformed, columns = one_hot_encoding_columns)

The main drawback of this encoding is the significant increase in the dimensionality of the dataset (curse of dimensionality); therefore, this method should be avoided when the categorical column has a large number of unique values.


Normalization


Data Normalization is a common practice in machine learning which consists of transforming numeric columns to a common scale.


In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process; however, it does not mean those variables are more important to predict the target.


Data normalization transforms multiscaled data to the same scale.


All variables will have a similar influence on the model, improving the stability and performance of the learning algorithm.


There are multiple normalization techniques in statistics.


I used the min-max method to rescale the numeric columns (tenure, MonthlyCharges, and TotalCharges) to a common scale. The min-max approach (often called normalization) rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range.

​# min-max normalization (numeric variables)

max_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']


# scale numerical variables using min max scaler

for column in minmax_columns:

# minimum value of the column

min_value_column = df_telco_churn_transformed[column].min()

# maximum value of the column

max_value_column = df_telco_churn_transformed[column].max()

# min max scaler

df_telco_churn_transformed[column] = (df_telco_churn_transformed[column] - min_value_column) / (max_value_column - min_value_column)


6. Setting a baseline

I often use a simple classifier called baseline to evaluate the performance of a model. In this problem, the rate of customers that did not churn (most frequent class) can be used as a baseline to evaluate the quality of the models generated. These models should outperform the baseline capabilities to be considered for future predictions.


7. Splitting the data in training and testing sets

The first step when building a model is to split the data into two groups, which are typically referred to as training and testing sets. The training set is used by the machine learning algorithm to build the model. The test set contains samples that are not part of the learning process and is used to evaluate the model’s performance. It is important to assess the quality of the model using unseen data to guarantee an objective evaluation.


First, I created a variable X to store the independent attributes of the dataset. Additionally, I created a variable y to store only the target variable (Churn).

# select independent variables

X = df_telco_churn_transformed.drop(columns='Churn')


# select dependent variables

y = df_telco_churn_transformed.loc[:, 'Churn']


# prove that the variables were selected correctly

print(X.columns)


# prove that the variables were selected correctly

print(y.name)

Then, I used the train_test_split function from the sklearn.model_selection package to create both the training and testing sets.

from sklearn.model_selection import train_test_split


# split the data in training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,

random_state=40, shuffle=True)

8. Assessing multiple algorithms


Algorithm selection is a key challenge in any machine learning project since there is not an algorithm that is the best across all projects.


Generally, I need to evaluate a set of potential candidates and select for further evaluation those that provide better performance.


I compared 5 different algorithms, all of them already implemented in Scikit-Learn.


Dummy classifier (baseline)

K Nearest Neighbors

Support Vector Machines

Random Forest

Gradient Boosting

from sklearn.dummy import DummyClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingClassifier


def create_models(seed=2):

'''

Create a list of machine learning models.

Parameters:

seed (integer): random seed of the models

Returns:

models (list): list containing the models

'''


models = []

models.append(('dummy_classifier', DummyClassifier(random_state=seed, strategy='most_frequent')))

models.append(('k_nearest_neighbors', KNeighborsClassifier()))

models.append(('support_vector_machines', SVC(random_state=seed)))

models.append(('random_forest', RandomForestClassifier(random_state=seed)))

models.append(('gradient_boosting', GradientBoostingClassifier(random_state=seed)))

return models


# create a list with all the algorithms I am going to assess

models = create_models()

from sklearn.metrics import accuracy_score


# test the accuracy of each model using default hyperparameters

results = []

names = []

scoring = 'accuracy'

for name, model in models:

# fit the model with the training data

model.fit(X_train, y_train).predict(X_test)

# make predictions with the testing data

predictions = model.predict(X_test)

# calculate accuracy

accuracy = accuracy_score(y_test, predictions)

# append the model name and the accuracy to the lists

results.append(accuracy)

names.append(name)

# print classifier accuracy

print('Classifier: {}, Accuracy: {})'.format(name, accuracy))

As shown below, all models outperformed the dummy classifier model in terms of prediction accuracy.

It is important to bear in mind that I have trained all the algorithms using the default hyperparameters. The accuracy of many machine learning algorithms is highly sensitive to the hyperparameters chosen for training the model. I will only further evaluate the model that presents higher accuracy using the default hyperparameters. As shown above, this corresponds to the gradient boosting model which shows an accuracy of nearly 80%.


9. Algorithm selected: Gradient Boosting

Gradient Boosting is a very popular machine learning ensemble method based on a sequential training of multiple models to make predictions. In Gradient Boosting, first, you make a model using a random sample of your original data. After fitting the model, you make predictions and compute the residuals of your model. The residuals are the difference between the actual values and the predictions of the model. Then, you train a new tree based on the residuals of the previous tree, calculating again the residuals of this new model. We repeat this process until we reach a threshold (residual close to 0), meaning there is a very low difference between the actual and predicted values. Finally, you take a sum of all model forecasts (prediction of the data and predictions of the error) to make a final prediction.


I easily built a gradient boosting classifier with Scikit-Learn using the GradientBoostingClassifier class from the sklearn.ensemble module. After creating the model, I need to train it (using the .fit method) and test its performance by comparing the predictions (.predict method) with the actual class values, as you can see in the code above.


The GradientBoostingClassifier has multiple hyperparameters; some of them are listed below:


learning_rate: the contribution of each tree to the final prediction.

n_estimators: the number of decision trees to perform (boosting stages).

max_depth: the maximum depth of the individual regression estimators.

max_features: the number of features to consider when looking for the best split.

min_samples_split: the minimum number of samples required to split an internal node.


The next step consists of finding the combination of hyperparameters that leads to the best classification of my data. This process is called hyperparameter tuning.


10. Hyperparameter tuning


I have split my data into a training set for learning the parameters of the model, and a testing set for evaluating its performance. The next step in the machine learning process is to perform hyperparameter tuning. The selection of hyperparameters consists of testing the performance of the model against different combinations of hyperparameters, selecting those that perform best according to a chosen metric and a validation method.


For hyperparameter tuning, I need to split my training data again into a set for training and a set for testing the hyperparameters (often called validation set). It is a very common practice to use k-fold cross-validation for hyperparameter tuning. The training set is divided again into k equal-sized samples, 1 sample is used for testing and the remaining k-1 samples are used for training the model, repeating the process k times. Then, the k evaluation metrics (in this case the accuracy) are averaged to produce a single estimator.


There are multiple techniques to find the best hyperparameters for a model. The most popular methods are (1) grid search, (2) random search, and (3) bayesian optimization. Grid search test all combinations of hyperparameters and select the best performing one. It is a really time-consuming method, particularly when the number of hyperparameters and values to try are really high.


In random search, you specify a grid of hyperparameters, and random combinations are selected where each combination of hyperparameters has an equal chance of being sampled. I do not analyze all combinations of hyperparameters, but only random samples of those combinations. This approach is much more computationally efficient than trying all combinations; however, it also has some disadvantages. The main drawback of random search is that not all areas of the grid are evenly covered, especially when the number of combinations selected from the grid is low.


I can implement random search in Scikit-learn using the RandomSearchCV class from the sklearn.model_selection package.


First of all, I specified the grid of hyperparameter values using a dictionary (grid_parameters) where the keys represent the hyperparameters and the values are the set of options I want to evaluate. Then, I defined the RandomizedSearchCV object for trying different random combinations from this grid. The number of hyperparameter combinations that are sampled is defined in the n_iter parameter. Naturally, increasing n_iter will lead in most cases to more accurate results, since more combinations are sampled; however, on many occasions, the improvement in performance won’t be significant

text


​from sklearn.metrics import confusion_matrix


# make the predictions

random_search_predictions = random_search.predict(X_test)


# construct the confusion matrix

confusion_matrix = confusion_matrix(y_test, random_search_predictions)


# visualize the confusion matrix

confusion_matrix

After fitting the grid object, I can obtain the best hyperparameters using best_params_attribute. As you can above, the best hyperparameters are: {‘n_estimators’: 90, ‘min_samples_split’: 3, ‘max_features’: ‘log2’, ‘max_depth’: 3}.


11. Performace of the model


The last step of the machine learning process is to check the performance of the model (best hyperparameters ) by using the confusion matrix and some evaluation metrics.


Confusion matrix The confusion matrix, also known as the error matrix, is used to evaluate the performance of a machine learning model by examining the number of observations that are correctly and incorrectly classified. Each column of the matrix contains the predicted classes while each row represents the actual classes or vice versa. In a perfect classification, the confusion matrix will be all zeros except for the diagonal. All the elements out of the main diagonal represent misclassifications. It is important to bear in mind that the confusion matrix allows me to observe patterns of misclassification (which classes and to which extend they were incorrectly classified).


In binary classification problems, the confusion matrix is a 2-by-2 matrix composed of 4 elements:


TP (True Positive): number of patients with spine problems that are correctly classified as sick. TN (True Negative): number of patients without pathologies who are correctly classified as healthy. FP (False Positive): number of healthy patients that are wrongly classified as sick. FN (False Negative): number of patients with spine diseases that are misclassified as healthy.


from sklearn.metrics import confusion_matrix


# make the predictions

random_search_predictions = random_search.predict(X_test)


# construct the confusion matrix

confusion_matrix = confusion_matrix(y_test, random_search_predictions)


# visualize the confusion matrix

confusion_matrix


As shown above, 1402 observations of the testing data were correctly classified by the model (1154 true negatives and 248 true positives). On the contrary, I can observe 356 misclassifications (156 false positives and 200 false negatives).


Evaluation metrics Evaluating the quality of the model is a fundamental part of the machine learning process. The most used performance evaluation metrics are calculated based on the elements of the confusion matrix.


Accuracy: It represents the proportion of predictions that were correctly classified. Accuracy is the most commonly used evaluation metric; however, it is important to bear in mind that accuracy can be misleading when working with imbalanced datasets.


Sensitivity: It represents the proportion of positive samples (diseased patients) that are identified as such.


Specificity: It represents the proportion of negative samples (healthy patients) that are identified as such.


Precision: It represents the proportion of positive predictions that are actually correct.


I can calculate the evaluation metrics manually using the numbers of the confusion matrix. Alternatively, Scikit-learn has already implemented the function classification_report that provides a summary of the key evaluation metrics. The classification report contains the precision, sensitivity, f1-score, and support (number of samples) achieved for each class.


I obtain a sensitivity of 0.55 (248/(200+248)) and a specificity of 0.88 (1154/(1154+156)). The model obtained predicts more accurately customers that do not churn. This should not be a surprise at all, since gradient boosting classifiers are usually biased toward the classes with more observations.


As you may have noticed, the previous summary does not contain the accuracy of the classification. However, this can be easily calculated using the function accuracy_score from the metrics module.


As you can observe, hyperparameter tuning has barely increased the accuracy of the model.


12. Drawing conclusions — Summary


I have walked through a complete end-to-end machine learning project using the Telco customer Churn dataset. I started by cleaning the data and analyzing it with visualization. Then, to be able to build a machine learning model, I transformed the categorical data into numeric variables (feature engineering). After transforming the data, I tried 5 different machine learning algorithms using default parameters. Finally, I tuned the hyperparameters of the Gradient Boosting Classifier (best performance model) for model optimization, obtaining an accuracy of nearly 80% (close to 6% higher than the baseline).

40 views0 comments

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page