Sunday, November 17, 2024
Google search engine
HomeLanguagesPandas Cut – Continuous to Categorical

Pandas Cut – Continuous to Categorical

Numerical data such as continuous, highly skewed data is frequently seen in data analysis. Sometimes analysis becomes effortless on conversion from continuous to discrete data. There are many ways in which conversion can be done, one such way is by using Pandas’ integrated cut-function. Pandas’ cut function is a distinguished way of converting numerical continuous data into categorical data. It has 3 major necessary parts:

  1. First and foremost is the 1-D array/DataFrame required for input.
  2. The other main part is bins. Bins that represent boundaries of separate bins for continuous data. The first number denotes the start point of the bin and the following number denotes the endpoint of the bin. Cut function permits more explicitness of the bins
  3. The final main part is labels. The number of labels without exception will be one lower than the number of bins.

Note: For any NA values, the result will be stored as NA. Out of bounds values will also be NA in the resultant categorical bins.

On using the pandas cut function, it fails to guarantee the distribution of values in each bin. As a matter of fact, we might end up defining bins in such a way that the bin may not contain any value.

Syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

Parameters:

  • x: Input array. Need to be 1-dimensional.
  • bins: Denotes the bin boundaries for segmentation
  • right: Denotes whether rightmost edge of bins should be included or not. Boolean type of value. Default value is True.
  • labels: Defines labels for returned segmented bins. Array or boolean

Return Value: Returns a Categorical series/numpy array/IntervalIndex

Example 1: Let’s say we have an array ‘Age’ of 15 random numbers from 1 to 100 and we wish to separate data into 4 bins of categories –

'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years

Python3




# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 15 numbers randomly
# ranging from 1-100 for age
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4,
                           10, 15, 38, 22, 77]})
  
# Printing DataFrame Before sorting Continuous 
# to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Age into 4 Categories
# Baby/Toddler: (0,3], 0 is excluded & 3 is included
# Child: (3,17], 3 is excluded & 17 is included
# Adult: (17,63], 17 is excluded & 63 is included
# Elderly: (63,99], 63 is excluded & 99 is included
df['Label'] = pd.cut(x=df['Age'], bins=[0, 3, 17, 63, 99],
                     labels=['Baby/Toddler', 'Child', 'Adult',
                             'Elderly'])
  
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())


Output: 

Before: 
    Age
0    42
1    15
2    67
3    55
4     1
5    29
6    75
7    89
8     4
9    10
10   15
11   38
12   22
13   77
After: 
    Age         Label
0    42         Adult
1    15         Child
2    67       Elderly
3    55         Adult
4     1  Baby/Toddler
5    29         Adult
6    75       Elderly
7    89       Elderly
8     4         Child
9    10         Child
10   15         Child
11   38         Adult
12   22         Adult
13   77       Elderly
Categories: 
Adult           5
Elderly         4
Child           4
Baby/Toddler    1
Name: Label, dtype: int64

Example #2: Let’s say we have an array ‘Height’ of 12 random people starting from 150cm to 180cm and we wish to separate data into 3 bins of categories.

'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm

Python3




# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
                              159.2, 175, 162.4, 176, 153, 170.9]})
  
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
                     bins=[150, 157, 169, 180],
                     labels=['Short', 'Average', 'Tall'])
  
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())


Output:

Before: 
    Height
0    150.4
1    157.6
2    170.0
3    176.0
4    164.2
5    155.0
6    159.2
7    175.0
8    162.4
9    176.0
10   153.0
11   170.9
After: 
    Height    Label
0    150.4    Short
1    157.6  Average
2    170.0     Tall
3    176.0     Tall
4    164.2  Average
5    155.0    Short
6    159.2  Average
7    175.0     Tall
8    162.4  Average
9    176.0     Tall
10   153.0    Short
11   170.9     Tall
Categories: 
Tall       5
Average    4
Short      3
Name: Label, dtype: int64

RELATED ARTICLES

Most Popular

Recent Comments