Friday, November 15, 2024
Google search engine
HomeLanguagesHow to use pandas cut() and qcut()?

How to use pandas cut() and qcut()?

Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series.

In this tutorial, we’ll look at pandas’ intelligent cut and qcut functions. Basically, we use cut and qcut to convert a numerical column into a categorical one, perhaps to make it better suited for a machine learning model (in case of a fairly skewed numerical column), or just for better analyzing the data at hand. Now, rather than blurting out technical definitions of cut and qcut, we’d be better off seeing what both these functions are good at and how to use them.

We’ll first import the necessary data manipulating libraries. 
 

Python3




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


 
 

We’ll be using the CarDekho dataset, containing data about used cars listed on the platform. You can find the dataset here

 

  • ‘Year’ is the year in which the car was purchased.
  • ‘Selling_Price’ is the price the owner wants to sell the car at.
  • ‘Present_Price’ is the current ex-showroom price of the car.
  • ‘Owner’ defines the number of owners the car has previously had, before this car was put up on the platform.

 

Rest of the columns are pretty self explanatory.
 

 

Python3




df = pd.read_csv('cardekho.csv')
df.head()


 
 

Output:

 

df_head

 

Python3




# no missing values in the dataset
df.info()


df_info

We’ll infuse a missing value to better demonstrate how cut and qcut would handle an ‘imperfect’ dataset.
 

Python3




df.loc[0, 'Year'] = np.nan
 
# these are the 'unique' years in
# the data
np.array(sorted(df.Year.unique()))


Output:

unique_yrs

pd.cut()

We can use the ‘cut’ function in broadly 2 ways: by specifying the number of bins directly and let pandas do the work of calculating equal-sized bins for us, or we can manually specify the bin edges as we desire.
 

Python3




# right = True, by default
pd.cut(df.Year, bins=3, right=True).head()


Output:

cut1

When we specified bins=3, pandas saw that the year range in the data is 2003 to 2018, hence appropriately cut it into 3 equal-width bins of 5 years each: [(2002.985, 2008.0] < (2008.0, 2013.0] < (2013.0, 2018.0]. Note that pandas automatically took the lower bound value of  the first category (2002.985) to be a fraction less that the least value in the ‘Year’ column (2003), to include the year 2003 in the results as well, because you see, the lower bounds of the bins are open ended, while the upper bounds are closed ended (as right=True)
 

Note: Did you notice that the NaN values are kept as NaNs in the output result as well? When we specify right=False, the left bounds are now closed ended, while right bounds get open ended.
 

Python3




pd.cut(df.Year, bins=3, right=False).head()


 
 

Output:

 

cut2

 

Instead of getting the intervals back, we can specify the ‘labels’ parameter as a list for better analysis.
 

 

Python3




pd.cut(df.Year, bins=3,
       labels=['old', 'medium', 'new']).head()


 
 

Output:

 

cut3

 

We will assign this series back to the original dataframe:
 

 

Python3




df['Yr_cut'] = pd.cut(df.Year, bins=3,
                      labels=['old', 'medium', 'new'])
df.head()


 
 

Output:

 

cut4df

 

If we specify labels=False, instead of bin labels, we will get numeric representation of the bins: Here, 0 represents old, 1 is medium and 2 is new.
 

 

Python3




pd.cut(df.Year, bins=3, labels=False).head()


 
 

Output:

 

cut5

 

Just to see how many values fall in each bin:
 

 

Python3




df['Yr_cut'].value_counts()


 
 

Output:

 

cut6

 

And just because drawing a graph pleases more people than offends..
 

 

Python3




df['Yr_cut'].value_counts().plot(kind='barh')
plt.show()


 
 

Output:

 

cut bar plot

 

Now, if we need the bin intervals along with the discretized series at one go, we specify retbins=True. Pandas will give us back a tuple containing 2 elements: the series, and the bin intervals. We will use tuple unpacking to grab both outputs.
 

 

Python3




cut_series, cut_intervals = pd.cut(df.Year,
                                   bins=3,
                                   retbins=True)
 
print("Cut series:")
print(cut_series.head())
print()
print("Cut intervals: ", cut_intervals)


 
 

Output:

 

cut8

 

As mentioned earlier, we can also specify bin edges manually by passing in a list:
 

 

Python3




pd.cut(df.Year, bins=[2003,
                      2007,
                      2010,
                      2015,
                      2018],
       include_lowest=True).head()


 
 

Output:

 

cut9

 

Here, we had to mention include_lowest=True. Can you guess why? Because by default ‘include_lowest’ parameter is set to False, and hence when pandas sees the list that we passed, it will exclude 2003 from calculations. For the eagle-eyed, we could have used any value less than 2003 as well, like 1999 or 2002 or 2002.255 etc and gone ahead with the default setting of include_lowest=False.
 

 

pd.qcut()

 

Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number of elements in each bin will be roughly the same, but this will come at the cost of differently sized interval widths. On the other hand, in cut, the bin edges were equal sized (when we specified bins=3) with uneven number of elements in each bin or group. Also, cut is useful when you know for sure the interval ranges and the bins, 

 

For example, if binning an ‘age’ column, we know infants are between 0 and 1 years old, 1-12 years are kids, 13-19 are teenagers, 20-60 are working class grownups, and 60+ senior citizens. So we can appropriately set bins=[0, 1, 12, 19, 60, 140] and labels=[‘infant’, ‘kid’, ‘teenager’, ‘grownup’, ‘senior citizen’]. In qcut, when we specify q=5, we are telling pandas to cut the Year column into 5 equal quantiles, i.e. 0-20%, 20-40%, 40-60%, 60-80% and 80-100% buckets/bins.
 

 

Python3




pd.qcut(df.Year, q=5).head(7)


 
 

Output:

 

qcut1

 

We’ll assign this series to the dataframe.
 

 

Python3




df['Yr_qcut'] = pd.qcut(df.Year, q=5,
                        labels=['oldest',
                                'not so old',
                                'medium',
                                'newer',
                                'latest'])
df.head()
 
df['Yr_qcut'].value_counts().plot(kind='barh')
plt.show()
 
qcut_series, qcut_intervals = pd.qcut(df.Year, q=5,
                                      labels=['oldest',
                                              'not so old',
                                              'medium',
                                              'newer',
                                              'latest'],
                                      retbins=True)
 
qcut_series.value_counts()


 
 

Output:

 

qcut2qcut3gqcut4

 

Now just to highlight the fact that q=5 indeed implies splitting values into 5 equal quantiles of 20% each, we’ll manually specify the quantiles, and get the same bin distributions as above.
 

 

Python3




pd.qcut(df.Year,
        q=[0, 0.2, 0.4,
           0.6, 0.8, 1.0],
        labels=['oldest',
                'not so old',
                'medium',
                'newer',
                'latest']).value_counts()


 
 

Output:

 

qcut5

 

We’ll now see the qcut intervals array we got using tuple unpacking: 
 

 

Python3




qcut_intervals


 
 

Output:

 

qcut6

 

You see? Here in qcut, the bin edges are of unequal widths, because it is accommodating 20% of the values in each bucket, and hence it is calculating the bin widths on its own to achieve that objective.
 

 

Python3




# demonstrating with some random quantiles
pd.qcut(df.Year, q=[0,0.15,0.35,
                    0.51,0.78,1]).head()


 
 

Output:

 

qcut7

 

Now it is binning the data into our custom made list of quantiles of 0-15%, 15-35%, 35-51%, 51-78% and 78-100%.
With qcut, we’re answering the question of “which data points lie in the first 15% of the data, or in the 51-78 percentile range etc. Additionally, we can also use pandas’ interval_range, or numpy’s linspace and arange to generate a list of interval ranges and feed it to cut and qcut as the bins and q parameter respectively.
 

 

Python3




pd.cut(df.Year,
       bins=pd.interval_range(start=2002.99,
                              end=2018,
                              periods=3)).head()


 
 

Output:

 

qcut8

 

Using linspace: 
 

 

Python3




# basically dividing data into 3
# quantiles
pd.qcut(df.Kms_Driven,
        q=np.linspace(0,1,4)).head()


Output:

qcut9

Using arange:

Python3




pd.qcut(df.Selling_Price,
        q=np.arange(0, 1.01, 0.25),
        labels=['inexpensive',
                'average',
                'high cost',
                'over budget']).head()


 
 

Output:

 

qcut10

 

Sometimes when we ask pandas to calculate the bin edges for us, you may run into an error which looks like: 
ValueError: Bin edges must be unique error. This implies that while calculating the bin intervals, pandas found that some bin edges were the same on both ends, like an interval of (2014, 2014] and hence it raised that error. This usually happens when the number of bins is large and the value range of the particular column is small. You can silence this error by passing the argument of duplicates=’drop’.
 

 

RELATED ARTICLES

Most Popular

Recent Comments