Overview
In this article, we’re going to discuss employee attrition prediction i.e. predicting that employee will leave the current company (or will resign from the current company) and we will do this using several machine learning algorithms (basically 6 ML algorithms) but this article is gonna be completely step by step explanation. So let’s get started.
Need of Employee Attrition prediction
- Managing workforce: If the supervisors or HR came to know about some employees that they will be planning to leave the company then they could get in touch with those employees which can help them to stay back or they can manage the workforce by hiring the new alternative of those employees.
- Smooth pipeline: If all the employees in the current project are working continuously on a project then the pipeline of that project will be smooth but if suppose one efficient asset of the project(employee) suddenly leave that company then the workflow will be not so smooth
- Hiring Management: If HR of one particular project came to know about the employee who is willing to leave the company then he/she can manage the number of hiring and they can get the valuable asset whenever they need so for the efficient flow of work.
Table of content
- Importing libraries
- Data exploration
- Data cleaning
- Splitting data (train test split)
- Model development applying 6-ML algorithms
- Logistic Regression
- Decision tree
- KNN
- SVM
- Random Forest
- Naive Bayes
- Saving model
Importing libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn import datasets from sklearn.metrics import accuracy_score
Reading the dataset
attrdata = pd.read_csv("Table_1.csv")
Let’s look at our dataset and plan to torture it!
attrdata.head()
Output:
Data exploration
Dropping the index column
attrdata.drop(0,inplace=True) attrdata.isnull().sum()
Output:
table id 0 name 0 phone number 0 Location 0 Emp. Group 0 Function 0 Gender 0 Tenure 0 Tenure Grp. 0 Experience (YY.MM) 4 Marital Status 0 Age in YY. 0 Hiring Source 0 Promoted/Non Promoted 0 Job Role Match 2 Stay/Left 0 dtype: int64
As we can see that there are null values in the Experience and Job role so we have to drop them.
attrdata.dropna(axis=0,inplace=True)
Output:
table id 0 name 0 phone number 0 Location 0 Emp. Group 0 Function 0 Gender 0 Tenure 0 Tenure Grp. 0 Experience (YY.MM) 0 Marital Status 0 Age in YY. 0 Hiring Source 0 Promoted/Non Promoted 0 Job Role Match 0 Stay/Left 0 dtype: int64
The shape of our dataset
attrdata.shape
Output:
(895, 16)
Let’s explore all the categorical values and visualize them
Now, we will use the value_counts function so that we can get the unique values from every categorical type of data.
Gender
gender_dict = attrdata["Gender "].value_counts() gender_dict
Output:
Male 655 Female 234 other 6 Name: Gender , dtype: int64
Understanding the balancing of the Gender column visually
attrdata['Gender '].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Count of different gender")
Output:
Here, from the chart, it’s visible that the count of males is more than another category of the gender.
- Male: 655
- Female: 234
- Other: 6
Now, let’s figure out that how gender could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab pd.crosstab(attrdata['Gender '],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6)) plt.title("Stay/Left vs Gender") plt.xlabel("Stay/Left") plt.ylabel("No of people who left based on gender") plt.legend(["Left","Stay"]) plt.xticks(rotation=0)
Output:
Here, from the chart it’s visible that it heavily depends on males, also we can see that it’s either male, female or others but more number of them are staying in the company.
Promotion (Promoted/ Non-Promoted)
promoted_dict = attrdata["Promoted/Non Promoted"].value_counts() promoted_dict
Output:
Promoted 457 Non Promoted 438 Name: Promoted/Non Promoted, dtype: int64
attrdata['Promoted/Non Promoted'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Promoted and Non Promoted")
Output:
Now, from the above chart, we can see that when it comes to Promoted and Non-Promoted employees it’s quiet in balanced number.
Now, let’s figure out that how promotion could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab pd.crosstab(attrdata['Promoted/Non Promoted'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6)) plt.title("Stay/Left vs Promoted/Non Promoted") plt.xlabel("Stay/Left") plt.ylabel("No. of people who left/stay based on promotion") plt.legend(["Left","Stay"]) plt.xticks(rotation=0)
Output:
Here, from the chart, it’s visible that the ones who are not promoted are leaving the company more as compared to the ones who are promoted which is also an obvious thing likely to happen.
Function (Operation/ Support/ Sales)
func_dict = attrdata["Function"].value_counts() func_dict
Output:
Operation 831 Support 52 Sales 12 Name: Function, dtype: int64
attrdata['Function'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Functions in organization")
Output:
Now, we can see that majority of the function performed by employees are Operation itself then support and at the last it’s sales.
Now, let’s figure out that how function could be the reason for employees to leave the company or to stay in.
#Create a plot for crosstab pd.crosstab(attrdata['Function'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6)) plt.title("Stay/Left vs Function") plt.xlabel("Stay/Left") plt.ylabel("No. of people who left/stay based on function of organization") plt.legend(["Left","Stay"]) plt.xticks(rotation=0)
Output:
Here, in the chart, we can see that the maximum number of employees are in the operation section and a high number of employees in the same section are staying in the company.
Hiring (Direct/ Agency/ Employee referral)
Hiring_dict = attrdata["Hiring Source"].value_counts() Hiring_dict
Output:
Direct 708 Agency 116 Employee Referral 71 Name: Hiring Source, dtype: int64
Marital Status (Single/ Married/ Seperated/ Div./ NTBD)
Marital_dict = attrdata["Marital Status"].value_counts() print(Marital_dict)
Output:
Single 533 Marr. 356 Sep. 2 Div. 2 NTBD 2 Name: Marital Status, dtype: int64
Employee group
Emp_dict = attrdata["Emp. Group"].value_counts() Emp_dict['other group'] = 1 print(Emp_dict)
Output:
B1 537 B2 275 B3 59 B0 8 B4 7 B5 4 B7 2 D2 1 C3 1 B6 1 other group 1 Name: Emp. Group, dtype: int64
Job role match (Yes/ No)
job_dict = attrdata["Job Role Match"].value_counts() job_dict
Output:
Yes 480 No 415 Name: Job Role Match, dtype: int64
attrdata['Job Role Match'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Job Role Match")
Output:
Now, we can see that majority of the employees have their correct role in Job.
#Create a plot for crosstab pd.crosstab(attrdata['Job Role Match'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6)) plt.title("Stay/Left vs Job Role Match") plt.xlabel("Stay/Left") plt.legend(["Left","Stay"]) plt.xticks(rotation=0)
Output:
Here, in the above chart, we can see that the number of employees who got the correct job role is staying in the company rather than the ones who don’t have their right job role.
Tenure group
tenure_dict = attrdata["Tenure Grp."].value_counts() print(tenure_dict)
Output:
> 1 & < =3 626 < =1 269 Name: Tenure Grp., dtype: int64
Now let’s visualize some continuous data
# Its Age vs stay/left sns.jointplot(x='Stay/Left',y='Age in YY.',data=attrdata)
Output:
In the above graph, we can see that the ones who are having more age are staying back in the company rather than the ones who have comparatively less age.
sns.jointplot(x='Stay/Left',y='Experience (YY.MM)',data=attrdata)
Output:
Here in the above graph, we can see that the employees who have got more experience will be staying back in the company rather than the ones who have comparatively less experience.
Here, first, we are trying to get the correlation between variables where the dataset is not processed that’s why we are not able to see the results in the manner we want to, but in the latter half of the article, we will see the better correlation plot with the help of processed data.
# Let's make our correlation matrix visual corr_matrix=attrdata.corr() fig,ax=plt.subplots(figsize=(15,10)) ax=sns.heatmap(corr_matrix, annot=True, linewidths=0.5, fmt=".2f" )
Output:
Data cleaning
Encoding the locations column (categorized)
Build a new dictionary (location) to be used to categorize data columns after values are encoded. Here, in location_dict_new we are using integer values instead of the actual region name so that our machine learning model could interpret it.
location_dict = attrdata["Location"].value_counts() print(location_dict) location_dict_new = { 'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0, } print(location_dict_new)
Output:
Chennai 255 Noida 236 Bangalore 210 Hyderabad 62 Pune 55 Madurai 29 Lucknow 20 Nagpur 14 Vijayawada 6 Mumbai 4 Gurgaon 3 Kolkata 1 Name: Location, dtype: int64 {'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0}
Now we will make a function for the location column to make a new column where encoded location values will be there because our machine learning algorithm will only understand int/float values.
def location(x): if str(x) in location_dict_new.keys(): return location_dict_new[str(x)] else: return location_dict_new['other place'] data_l = attrdata["Location"].apply(location) attrdata['New Location'] = data_l attrdata.head()
Output:
get_dummies()
Pandas get_dummies() function is used for manipulating data, this function is used to convert the categorical values to dummy variables and the same thing has been done with:
- Function
- Hiring Source
- New Marital
- New Gender
- Tenure group
gen = pd.get_dummies(attrdata["Function"]) gen.head()
Output:
hr = pd.get_dummies(attrdata["Hiring Source"]) hr.head()
Output:
Marital Status
Here, in Mar() function we are using Maritial dictionary keys to convert those categorical values into acceptable type values for our ML models.
def Mar(x): if str(x) in Marital_dict.keys() and Marital_dict[str(x)] > 100: return str(x) else: return 'other status' data_l = attrdata["Marital Status"].apply(Mar) attrdata['New Marital'] = data_l attrdata.head()
Output:
Using the get_dummies to function for New Marital we are converting categorical values into dummy variables
Mr = pd.get_dummies(attrdata["New Marital"]) Mr.head()
Output:
Promoted/Not Promoted
Here, with the help of Promoted function, we are converting Promoted and Non promoted values into 1 and 0 respectively for encoding purposes.
def Promoted(x): if x == 'Promoted': return int(1) else: return int(0) data_l = attrdata["Promoted/Non Promoted"].apply(Promoted) attrdata['New Promotion'] = data_l attrdata.head()
Output:
Employee Group
Here first, we are creating a dictionary for the employee group and tagging each group to the respective integer values, later we are creating an emp() function where the encoding of the categorical values is done – similar to marital status.
Emp_dict_new = { 'B1': 4, 'B2': 3, 'B3': 2, 'other group': 1, }
def emp(x): if str(x) in Emp_dict_new.keys(): return str(x) else: return 'other group' data_l = attrdata["Emp. Group"].apply(emp) attrdata['New EMP'] = data_l emp = pd.get_dummies(attrdata["New EMP"]) attrdata.head()
Output:
Job Role Match
Here, we are using the Job() function where categorical values are Yes and No which needs to be converted into integer values i.e. 1/0 then we are assigning the New Job Role Match.
def Job(x): if x == 'Yes': return int(1) else: return int(0) data_l = attrdata["Job Role Match"].apply(Job) attrdata['New Job Role Match'] = data_l attrdata.head()
Output:
Gender
Here, we are using the Gen() function using gender_dict (dictionary) which will be encoded first using the dictionary keys, and then the changes will be applied to the dataset based on changes that are done.
def Gen(x): if x in gender_dict.keys(): return str(x) else: return 'other' data_l = attrdata["Gender "].apply(Gen) attrdata['New Gender'] = data_l attrdata.head()
Output:
get_dummies() function for the same purposes for New gender and Tenure groups.
gend = pd.get_dummies(attrdata["New Gender"]) gend.head()
Output:
tengrp = pd.get_dummies(attrdata["Tenure Grp."]) tengrp.head()
Output:
Now, let’s concatenate the columns which are being cleaned, sorted, and manipulated by us as processed data.
dataset = pd.concat([attrdata, hr, Mr, emp, tengrp, gen, gend], axis = 1) dataset.head()
Output:
dataset.columns
Output:
Index(['table id', 'name', 'phone number', 'Location', 'Emp. Group', 'Function', 'Gender ', 'Tenure', 'Tenure Grp.', 'Experience (YY.MM)', 'Marital Status', 'Age in YY.', 'Hiring Source', 'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left', 'New Location', 'New Marital', 'New Promotion', 'New EMP', 'New Job Role Match', 'New Gender', 'Agency', 'Direct', 'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male', 'other'], dtype='object')
Let’s drop the columns which are not important anymore
dataset.drop(["table id", "name", "Marital Status","Promoted/Non Promoted","Function","Emp. Group","Job Role Match","Location" ,"Hiring Source","Gender ", 'Tenure', 'New Gender', 'New Marital', 'New EMP'],axis=1,inplace=True) dataset1 = dataset.drop(['Tenure Grp.', 'phone number'], axis = 1) dataset1.columns
Output:
Index(['Experience (YY.MM)', 'Age in YY.', 'Stay/Left', 'New Location', 'New Promotion', 'New Job Role Match', 'Agency', 'Direct', 'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male', 'other'], dtype='object')
As I mentioned, this is the correlation plot on the processed dataset.
# Let's make our correlation matrix visual corr_matrix=dataset1.corr() fig,ax=plt.subplots(figsize=(15,10)) ax=sns.heatmap(corr_matrix, annot=True, linewidths=0.5, fmt=".2f" )
Output:
Let’s see our target column
# Target """ def Target(x): if x in "Stay": return False else: return True data_l = dataset1["Stay/Left"].apply(Target) dataset1['Stay/Left'] = data_l """ dataset1['Stay/Left'].head()
Output:
1 Stay 2 Stay 3 Stay 4 Stay 5 Stay Name: Stay/Left, dtype: object
Saving the cleaned dataset into another CSV file
dataset1.to_csv("processed table.csv")
Now, from the processed data we have to separate the features and target column again.
dataset = pd.read_csv("processed table.csv") dataset = pd.DataFrame(dataset) y = dataset["Stay/Left"] X = dataset.drop("Stay/Left",axis=1)
Splitting data – Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4) X_train.head()
Output:
Model Development
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn import svm
Initializing the models
- Logistic Regression : C: Inverse of regularization strength (float), random state: (int), solver: sag,saga,liblinear (Here, we are using liblinear).
- Decision trees: Default parameters
- Random forest: Default parameters
- Gaussian Naive Bayes: Default parameters
- K-nearest neighbors: n_neighbors=3 – we can have another number of neighbors too.
- Support vector machines: kernel can be linear, polynomial, RBF, sigmoid. Here we are using a linear kernel function.
lr=LogisticRegression(C = 0.1, random_state = 42, solver = 'liblinear') dt=DecisionTreeClassifier() rm=RandomForestClassifier() gnb=GaussianNB() knn = KNeighborsClassifier(n_neighbors=3) svm = svm.SVC(kernel='linear')
Now, from one block of code, we will check the accuracy of all the model
for a,b in zip([lr,dt,knn,svm,rm,gnb],["Logistic Regression","Decision Tree","KNN","SVM","Random Forest","Naive Bayes"]): a.fit(X_train,y_train) prediction=a.predict(X_train) y_pred=a.predict(X_test) score1=accuracy_score(y_train,prediction) score=accuracy_score(y_test,y_pred) msg1="[%s] training data accuracy is : %f" % (b,score1) msg2="[%s] test data accuracy is : %f" % (b,score) print(msg1) print(msg2)
Output:
[Logistic Regression] training data accuracy is : 0.891061 [Logistic Regression] test data accuracy is : 0.877095 [Decision Tree] training data accuracy is : 1.000000 [Decision Tree] test data accuracy is : 0.849162 [KNN] training data accuracy is : 0.804469 [KNN] test data accuracy is : 0.586592 [SVM] training data accuracy is : 0.878492 [SVM] test data accuracy is : 0.865922 [Random Forest] training data accuracy is : 1.000000 [Random Forest] test data accuracy is : 0.888268 [Naive Bayes] training data accuracy is : 0.870112 [Naive Bayes] test data accuracy is : 0.826816
Model Scores (accuracy)
model_scores={'Logistic Regression':lr.score(X_test,y_test), 'KNN classifier':knn.score(X_test,y_test), 'Support Vector Machine':svm.score(X_test,y_test), 'Random forest':rm.score(X_test,y_test), 'Decision tree':dt.score(X_test,y_test), 'Naive Bayes':gnb.score(X_test,y_test) } model_scores
Output:
{'Logistic Regression': 0.8770949720670391, 'KNN classifier': 0.5865921787709497, 'Support Vector Machine': 0.8659217877094972, 'Random forest': 0.888268156424581, 'Decision tree': 0.8491620111731844, 'Naive Bayes': 0.8268156424581006}
Here, we can see that Logistic regression and Random forest have the best accuracy.
Classification Report of Random forest
from sklearn.metrics import classification_report rm_y_preds = rm.predict(X_test) print(classification_report(y_test,rm_y_preds))
Output:
precision recall f1-score support Left 0.81 0.77 0.79 61 Stay 0.88 0.91 0.90 118 accuracy 0.86 179 macro avg 0.85 0.84 0.84 179 weighted avg 0.86 0.86 0.86 179
Classification Report of Logistic Regression
from sklearn.metrics import classification_report lr_y_preds = lr.predict(X_test) print(classification_report(y_test,lr_y_preds))
Output:
precision recall f1-score support Left 0.81 0.84 0.82 61 Stay 0.91 0.90 0.91 118 accuracy 0.88 179 macro avg 0.86 0.87 0.86 179 weighted avg 0.88 0.88 0.88 179
Model Comparison
Based on the accuracy
model_compare=pd.DataFrame(model_scores,index=['accuracy']) model_compare
Output:
Visualize the accuracy of each model
model_compare.T.plot(kind='bar') # (T is here for transpose)
Output:
Yes, we can see that Random Forest has 1% better accuracy than Logistic regression but Random Forest is an overfitted model hence we will select Logistic regression.
Feature importance
These “coef’s” tell how much and in what way did each one of them contribute to predicting the target variable
# Logistic regression
feature_dict=dict(zip(dataset.columns,list(lr.coef_[0]))) feature_dict
This is a type of Model-driven Exploratory data analysis.
Output:
{'Unnamed: 0': -0.000369638613255323, 'Experience (YY.MM)': 0.13976826898148667, 'Age in YY.': -0.01962203036690505, 'Stay/Left': 0.024627352503955716, 'New Location': 0.09666512057880872, 'New Promotion': 2.7533361395664873, 'New Job Role Match': -0.31312348489873837, 'Agency': -0.020270885128439962, 'Direct': 0.2047872081147744, 'Employee Referral': 0.38796318779035893, 'Marr.': -0.5042922677790985, 'Single': -0.012278081923656183, 'other status': -0.22031197156191443, 'B1': 0.17649315194124499, 'B2': -0.09367965851449515, 'B3': 0.008891316222766718, 'other group': -0.12993373252552307, ' 1 & < =3': -0.05949484149777713, 'Operation': -0.018200464251483424, 'Sales': -0.05091185616314798, 'Support': 0.12249364800096194, 'Female': -0.1951606022872554, 'Male': -0.055940207626112265}
Visualize feature importance
feature_df=pd.DataFrame(feature_dict,index=[0]) feature_df.T.plot(kind="bar",legend=False,title="Feature Importance")
Output:
As we can see that “New promotion” column has the highest feature importance.
Saving the best model
Approach -1
Logistic Regression model because it has the best accuracy as well it is neither overfitted nor under fitted
import pickle # Save the trained model as a pickle string. saved_model = pickle.dumps(lr) # Load the pickled model lr_from_pickle = pickle.loads(saved_model) # Use the loaded pickled model to make predictions lr_from_pickle.predict(X_test)
Output:
array(['Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay'], dtype=object)
Approach – 2
# loading dependency import joblib # saving our model - model - model , filename - model_lr joblib.dump(lr , 'model_lr') # opening the file- model_jlib m_jlib = joblib.load('model_lr') # check prediction m_jlib.predict(X_test) # similar output
Output:
Okay, so that’s a wrap from my side!
Endnotes
Thank you for reading my article 🙂
I hope you guys will like this step-by-step learning of Employee Attrition prediction using machine learning.
Here’s the repo link to this article.
Here you can access my other articles which are published on Analytics Vidhya -Blogathon (link)
If you got stuck somewhere you can connect with me on LinkedIn, refer to this link
About me
Greeting to everyone, I’m currently working in TCS and previously I worked as a Data Science Associate Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field i.e. Data Science along with its other subsets of Artificial Intelligence such as, Computer Vision, Machine learning, and Deep learning feel free to collaborate with me on any project on the above-mentioned domains (LinkedIn).