In today’s era, companies work hard to make their customers happy. They launch new technologies and services so that customers can use their products more. They try to be in touch with each of their customers so that they can provide goods accordingly. But practically, it’s very difficult and non-realistic to keep in touch with everyone. So, here comes the usage of Customer Segmentation.
Customer Segmentation means the segmentation of customers on the basis of their similar characteristics, behavior, and needs. This will eventually help the company in many ways. Like, they can launch the product or enhance the features accordingly. They can also target a particular sector as per their behaviors. All of these lead to an enhancement in the overall market value of the company.
Customer Segmentation using Unsupervised Machine Learning in Python
Today we will be using Machine Learning to implement the task of Customer Segmentation.
Import Libraries
The libraries we will be required are :
- Pandas – This library helps to load the data frame in a 2D array format.
- Numpy – Numpy arrays are very fast and can perform large computations.
- Matplotlib / Seaborn – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
Python3
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.cluster import KMeans import warnings warnings.filterwarnings( 'ignore' ) |
Importing Dataset
The dataset taken for the task includes the details of customers includes their marital status, their income, number of items purchased, types of items purchased, and so on.
Python3
df = pd.read_csv( 'new.csv' ) df.head() |
Output:
To check the shape of the dataset we can use data.shape method.
Python3
df.shape |
Output:
(2240, 25)(2240, 25)
To get the information of the dataset like checking the null values, count of values, etc. we will use .info() method.
Data Preprocessing
Python3
df.info() |
Output:
Python3
df.describe().T |
Output:
Improving the values in the Accepted column.
Python3
df[ 'Accepted' ] = df[ 'Accepted' ]. str .replace( 'Accepted' , '') |
To check the null values in the dataset.
Python3
for col in df.columns: temp = df[col].isnull(). sum () if temp > 0 : print (f 'Column {col} contains {temp} null values.' ) |
Output:
Column Income contains 24 null values.
Now, once we have the count of the null values and we know the values are very less we can drop them (it will not affect the dataset much).
Python3
df = df.dropna() print ( "Total missing values are:" , len (df)) |
Output:
Total missing values are: 2216
To find the total number of unique values in each column we can use data.unique() method.
Python3
df.nunique() |
Output:
Here we can observe that there are columns which contain single values in the whole column so, they have no relevance in the model development.
Also dataset has a column Dt_Customer which contains the date column, we can convert into 3 columns i.e. day, month, year.
Python3
parts = df[ "Dt_Customer" ]. str .split( "-" , n = 3 , expand = True ) df[ "day" ] = parts[ 0 ].astype( 'int' ) df[ "month" ] = parts[ 1 ].astype( 'int' ) df[ "year" ] = parts[ 2 ].astype( 'int' ) |
Now we have all the important features, we can now drop features like Z_CostContact, Z_Revenue, Dt_Customer.
Python3
df.drop([ 'Z_CostContact' , 'Z_Revenue' , 'Dt_Customer' ], axis = 1 , inplace = True ) |
Data Visualization and Analysis
Data visualization is the graphical representation of information and data in a pictorial or graphical format. Here we will be using bar plot and count plot for better visualization.
Python3
floats, objects = [], [] for col in df.columns: if df[col].dtype = = object : objects.append(col) elif df[col].dtype = = float : floats.append(col) print (objects) print (floats) |
Output:
['Education', 'Marital_Status', 'Accepted'] ['Income']
To get the count plot for the columns of the datatype – object, refer the code below.
Python3
plt.subplots(figsize = ( 15 , 10 )) for i, col in enumerate (objects): plt.subplot( 2 , 2 , i + 1 ) sb.countplot(df[col]) plt.show() |
Output:
Let’s check the value_counts of the Marital_Status of the data.
Python3
df[ 'Marital_Status' ].value_counts() |
Output:
Now lets see the comparison of the features with respect to the values of the responses.
Python3
plt.subplots(figsize = ( 15 , 10 )) for i, col in enumerate (objects): plt.subplot( 2 , 2 , i + 1 ) sb.countplot(df[col], hue = df[ 'Response' ]) plt.show() |
Output:
Label Encoding
Label Encoding is used to convert the categorical values into the numerical values so that model can understand it.
Python3
for col in df.columns: if df[col].dtype = = object : le = LabelEncoder() df[col] = le.fit_transform(df[col]) |
Heatmap is the best way to visualize the correlation among the different features of dataset. Let’s give it the value of 0.8
Python3
plt.figure(figsize = ( 15 , 15 )) sb.heatmap(df.corr() > 0.8 , annot = True , cbar = False ) plt.show() |
Output:
Standardization
Standardization is the method of feature scaling which is an integral part of feature engineering. It scales down the data and making it easier for the machine learning model to learn from it. It reduces the mean to ‘0’ and the standard deviation to ‘1’.
Python3
scaler = StandardScaler() data = scaler.fit_transform(df) |
Segmentation
We will be using T-distributed Stochastic Neighbor Embedding. It helps in visualizing high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the values to low-dimensional embedding.
Python3
from sklearn.manifold import TSNE model = TSNE(n_components = 2 , random_state = 0 ) tsne_data = model.fit_transform(df) plt.figure(figsize = ( 7 , 7 )) plt.scatter(tsne_data[:, 0 ], tsne_data[:, 1 ]) plt.show() |
Output:
There are certainly some clusters which are clearly visual from the 2-D representation of the given data. Let’s use the KMeans algorithm to find those clusters in the high dimensional plane itself
KMeans Clustering can also be used to cluster the different points in a plane.
Python3
error = [] for n_clusters in range ( 1 , 21 ): model = KMeans(init = 'k-means++' , n_clusters = n_clusters, max_iter = 500 , random_state = 22 ) model.fit(df) error.append(model.inertia_) |
Here inertia is nothing but the sum of squared distances within the clusters.
Python3
plt.figure(figsize = ( 10 , 5 )) sb.lineplot(x = range ( 1 , 21 ), y = error) sb.scatterplot(x = range ( 1 , 21 ), y = error) plt.show() |
Output:
Here by using the elbow method we can say that k = 6 is the optimal number of clusters that should be made as after k = 6 the value of the inertia is not decreasing drastically.
Python3
# create clustering model with optimal k=5 model = KMeans(init = 'k-means++' , n_clusters = 5 , max_iter = 500 , random_state = 22 ) segments = model.fit_predict(df) |
Scatterplot will be used to see all the 6 clusters formed by KMeans Clustering.
Python3
plt.figure(figsize = ( 7 , 7 )) sb.scatterplot(tsne_data[:, 0 ], tsne_data[:, 1 ], hue = segments) plt.show() |
Output: