Data exploration is an important aspect of the machine learning pipeline. Before we decide which model to train and how many to train, we must have an idea of what our data contains. The Pandas library is equipped with a number of useful functions for this very purpose and value_counts
is one of them. This function returns the count of unique items in a pandas dataframe. However, most of the time, we end up using value_counts with the default parameters. In this brief article, I’ll show you how to achieve more by altering the default parameters.
[Related Article: Data Valuation – What is Your Data Worth and How do You Value it?]
value_counts()
The value_counts() method returns a Series
containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
Syntax
Series.value_counts
()
Parameters
Basic usage
Let’s see the basic usage of this method by on a dataset. I’ll be using the Titanic dataset for the demo. I have also published an accompanying notebook on Kaggle, incase you want to get directly to the codes.
Importing the dataset
Let’s begin by importing the necessary libraries and the dataset. This is a fundamental step in every data analysis process.
# Importing necessary librariesimport pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline# Reading in the data train = pd.read_csv('../input/titanic/train.csv')
Explore the first few rows of the dataset
train.head()
Calculating the number of null values
train.isnull().sum()
Thus, the Age
, Cabin
and Embarked
columns have null values. With this, we have a bare idea of what are dataset looks like. Let’s now see how we can use value_counts()
in five different ways to explore this data further.
1. value_counts() with default parameters
Let’s call the value_counts()
on the Embarked
column of the dataset. This will return the count of unique occurrences in this column.
train['Embarked'].value_counts() -------------------------------------------------------------------S 644 C 168 Q 77
The function returns the count of all unique values in the given index in descending order without any null values. We can quickly see that the maximum people embarked from Southampton, followed by Cherbourg and then Queenstown.
2. value_counts() with relative frequencies of the unique values.
Sometimes, getting a percentage is a better criterion then the count. By setting normalize=True
, the object returned will contain the relative frequencies of the unique values. The normalize
parameter is set to False
by default.
train['Embarked'].value_counts(normalize=True) -------------------------------------------------------------------S 0.724409 C 0.188976 Q 0.086614
Knowing that 72% of people embarked from Southampton is a better metric than saying 644 people embarked from Southampton.
3. value_counts() in ascending order
The series returned by value_counts()
is in descending order by default. We can reverse the case by setting the ascending
parameter to True
.
train['Embarked'].value_counts(ascending=True) -------------------------------------------------------------------Q 77 C 168 S 644
4. value_counts() displaying the NaN values
By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna
parameter to False
.
train['Embarked'].value_counts(dropna=False) -------------------------------------------------------------------S 644 C 168 Q 77 NaN 2
We can easily see that there are two null values in the column.
5. value_counts() to bin continuous data into discrete intervals
This is one of my favorite uses of the value_counts() function and an underutilized one too. value_counts()
can be used to bin continuous data into discrete intervals with the help of the bin
parameter. This option works only with numerical data. It is similar to the pd.cut
function. Let’s see how it works using the Fare
column.
# applying value_counts on a numerical column without the bin parametertrain['Fare'].value_counts()
This doesn’t convey much information as the output contains a lot of categories for every value of Fare. Instead, let’s group them into seven bins.
train['Fare'].value_counts(bins=7)
Binning makes it easy to understand the idea being conveyed. We can easily see that most of the people out of the total population paid less than 73.19 for their ticket. Also, we can see that having five bins serves our purpose since no passenger falls into the last two bins.
[Related Article: From Pandas to Scikit-Learn — A New Exciting Workflow]
Thus, we can see that value_counts() is a handy tool, and we can do some interesting analysis with this single line of code.