Predicting customer churn using Ensemble Technique and tkinter

Minakshi Mathpal
11 min readSep 28, 2021

This blog will walk you through an end to end Telecom churn prediction model. It covers EDA, data visualization and different model comparison in detail. Our aim in this article is to model and compare Churn Prediction Machine Learning models using different ensemble techniques and select the best among them.

Losing customers is undesirable by any business , and if you can predict when a customer will stop using the service, you can leverage an opportunity to make decisions to retain the customer.

Churn prediction means identifying customers those are likely to stop using a service based on how they use the service. It is a critical prediction for many businesses and the reason being is acquiring new clients often is more costlier than retaining existing ones. Once you identify customers that are at risk of ceasing his or her relationship with a company, you can plan exactly what marketing action to take for each individual customer to maximize the chances that the customer will remain.

This article use the Telco Churn Customer Dataset, available on Kaggle. And all the code for this example is in python and is available here

Domain

Telecom: A telecom company wants to use their historical customer data to predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

Data Description:

Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

  • Customers who left within the last month — the column is called Churn
  • Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
  • Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
  • Demographic info about customers — gender, age range, and if they have partners and dependents

The Dataset

  • Import all the given datasets
  • Explore shape and size.

We will use the pandas library to work upon our dataset. First, let’s read the dataset.

We can explore more about the dataset with the function pandas.DataFrame.info(). This function gives us the overall information about the DataFrame like name, data type and the number of non-null values of the columns.

Dataframe info

Our data has a column “Total charges” that has numerical values but its data type is object. Thus we can conclude that it has some string values.

  • NOTE: Although it is ok to have whitespace in our data but we can’t have any whitespace if we want to draw a tree. So lets take care of that by replacing whitespace with _

Our Data has no missing values. Now, let’s see the churn distribution using seaborn library.

Univariate Analysis

Target: CHURN

Churn Distribution

We are dealing with unbalanced data . Dataset have 27% of Churn and 73% of non-Churn. This can effect the predictions by the model and it needs to be handled. If we use imbalanced data to train model directly , there are very high chances that our model will be biased towards majority class . i.e ‘Non-Churners’.

Numerical Feature EDA

  • tenure and monthly charges are fairly symmetrical
  • Total charges has positive skew
  • customers with lower total charges churn more
  • new customers are more likely to churn

Bivariate Analysis

  • The likelihood of customer churning increase with increase in charges
  • customers have highest probability of churning when the monthly charges exceeds 60 dollars
  • customers who have churned have paid on an average around 75 dollars per month
  • customers who do not churned most likely have paid around 20$ per month

Categorical Features EDA

  • dataset has significantly less senior citizens than non-seniors (only 16% are senior)
  • A higher proportion of senior citizens churn than non-senior citizens
  • Senior citizens on an average are paying higher monthly and total charges
  • Senior Citizens are more likely to churn than Non-Senior Citizens
  • Overall, those without partners are more likely to churn than those with partners
  • Customers without dependents are more likely to churn than those with dependents
  • Customers with dependents are paying higher total charges
  • Customers with higher monthly charges for phone service are more lilely to churn
  • Customers with higher total charges for phone service are more lilely to churn
Customers with multiple line are paying higher monthly charges and are more likely churn
  • Customers with multiple line are paying higher Total charges and monthly charges and are more likely churn than others
Customer with no internet service are less likely to churn
  • Customer with fiber_optic service are most likely yo churn
  • On an average customers with fiber optic are paying more monthly and total charges and thus are more likely to churn
  • Fiber optic cable is most common internet service. There is large population which doesn’t have internet service .
  • Customers with no online Backup and no online security are more likely to churn
  • Customers with no internet service are least likely to churn
  • Customer with no online backup are paying higher monthly charges and total charges for the service thus they are more likely to churn if we talk about churn rate based on monthly charges and total charges. On the contrary it is opposite for customers with online backup facility

This way we can check the relationship between for every possible categorical feature and numerical feature to have a better understanding of the data. Following are my observations for rest of the features:

  • Customers without dependents are more likely to churn than those with dependents
  • Overall, those without partners are more likely to churn than those with partners
  • On an average customers with fiber optic are paying more monthly and total charges and thus are more likely to churn
  • Customer with no internet service are paying higher monthly charges and total charges for the service thus they are more likely to churn if we talk about churn rate based on monthly charges and total charges. On the contrary it is opposite for customers with online security facility
  • Customer with no online backup are paying higher monthly charges and total charges for the service thus they are more likely to churn if we talk about churn rate based on monthly charges and total charges. On the contrary it is opposite for customers with online backup facility
  • Customers those who don’t have StreamingTV services and those who have StreamingTV services have equal probability of churning

Hypothesis Testing

We do hypothesis testing to draw inferences or some conclusion about the overall population or a population parameter by conducting some statistical tests on a sample. So what we are doing here is

Testing the independence of categorical Features with Target Variable (Chi Square Test for independence)

- Null Hypothesis:  There is no relationship between Feature and          
target variable
- Alternate Hypothesis: There is a significant relationship between
Feature and target variable

From Chi Square Test for independence we have figured out that gender and PhoneService has no relationship with our target feature. So we can drop these two features from our dataset.

Let’s analyze the relationship between numerical features and target column Testing the independence of Numerical Features with Target Variable(Independent t-test)

Null Hypothesis: There is no significant difference in feature         
variable for different categories of Target
Alternate Hypothesis: There is significant difference in feature for
different categories of Target variable

So we have three numerical columns in our dataset and all the three have a significant relationship with target.

Next we have columns in our dataset that contain categorical features (string values) for example parameter Gender have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data. To address this issue we will use One Hot Encoding technique.

After one hot encoding we have separate column for each category .

The Models

Now we have our data prepared to proceed with modeling , we just need to split the data to train and test. To do that we have train_test_split() from sklearn. But before that we have observed that our data is imbalanced. Only 27% of the data contribute towards minority class(class of interest). Thus we will split the data using stratification in order to maintain the same percentage of majority and minority class in training set and the testing set

Class imbalanced in data sets, has a potential impact in the training procedure of a classifier by learning a model that will be biased in favor of the majority class. Thus Balancing is also required. I have kept 10% of whole data separate for future prediction also.

One more statistical analysis that we can do is checking if the train and test data have similar statistical characteristics when compared with original data.

So our train and test data share similar statistical characteristics with original data.

As you would have noticed earlier before one-hot-encoding number of columns were 21 but after that we have 41 columns(features). Not every feature would be actually contributing towards your model’s prediction thus we can eliminate few features. Tree based algorithms have an attribute “feature_importances_” that we can use for feature importance based feature selection.

So what I have done is first I have modelled Decision Tree with randomundersampler() , tuned its hyperparameters using Gridsearchcv and then selected the “bestesimator_” and then fetched important features. Let’s look at the code chunk for this.

After this you again need to do a train test split because now our dataset has reduced number of features.

Do you remember we still have not seen how to handle data imbalance. The strategy which I have used is, I have evaluated different variants of the classifier which I have decided to model .

Let’s dive into the code for RandomForestClassifier

  • From above evaluation results “BalancedRandomClassifier”with “class_weight”is performing best among all the variant for our data with maximum test score and least standard deviation
  • now we will create the final model using the best model got from above results

When predicting on unseen data Iget the training accuracy of Training of 0.853 and Test Accuracy of 0.735.

With this dataset(imbalanced) Accuracy is not a good metric to evaluate the model because it measures the overall accuracy of the model, and as the dataset has imbalance, accuracy can mislead. So it’s good to choose other metrics. I have used Balanced accuracy as evaluation metric because Balanced accuracy is a better metric to use with imbalanced data. It takes into account both the positive and negative outcome classes and doesn’t mislead with imbalanced data. For more clarity I have also used confusion matrix and classification report as well.

Figure 1
Figure 2

Here figure 1 is the Performance evaluation of BalancedRandomforest Classifier with randomundersampler() and figure 2 represents performance evaluation of Randomforest classifier.

Though accuracy of Randomforest classifier is better than BalancedRandomforest Classifier but BalancedRandomforest Classifier is better performing than Randomforest Classifier. Here our class of interest “1” i.e we want to predict customers who are more likely to churn. Accuracy simply returns the percentage of labels you predicted correctly (i.e. there are 1000 labels, you predicted 980 accurately, i.e. you get a score of 98%. We are interested in recall of the model. Randomforest Classifier has a recall of just 49 % for class 1(Churn).BalancedRandomforest Classifier gives us a recall of 83%. But one thing to note here is when we are trying to increase the recall of the model the precision decreases. Unfortunately, you can’t have both precision and recall high. If you increase precision, it will reduce recall, and vice versa. This is called the precision/recall tradeoff.

Following is the Evaluation of different Model Performances

If we go by above presented classification reports then there is a fair balance between recall, precision and f1 score of GradientBoosting Classifier. We have a very similar results for XGBoost also. Though recall of BalancedRandomforest is highest amongst all the models but its’s precision is lacking among others.

Finally I dumped the GradientBoosting Classifier model so that it could be deployed.

GUI development:

I have Designed a clickable GUI desk application or web service application. This GUI should allow the user to input all future values and on a click use these values on the trained model above to predict. It displays the prediction as well.

Conclusion

For a prediction problem, Data Cleaning and Data Pre-processing step is important to identify and remove outliers & duplicate data, in order to create a dependable dataset. It improves the quality of the training data for analytics and helps in accurate decision-making.

Customer churn prediction is crucial to the long-term financial stability of a company. In this article, we have successfully created a machine learning model that’s able to predict customer churn with a recall score of 74%.

With more rigorous hyper parameter tuning and feature engineering recall could have been increased.

All the code for this example is in python and is available here. I would highly appreciate any kind of feedback. Thanks for reading

--

--