Numerical data such as continuous, highly skewed data is frequently seen in data analysis. Sometimes analysis becomes effortless on conversion from continuous to discrete data. There are many ways in which conversion can be done, one such way is by using Pandas’ integrated cut-function. Pandas’ cut function is a distinguished way of converting numerical continuous data into categorical data. It has 3 major necessary parts:
- First and foremost is the 1-D array/DataFrame required for input.
- The other main part is bins. Bins that represent boundaries of separate bins for continuous data. The first number denotes the start point of the bin and the following number denotes the endpoint of the bin. Cut function permits more explicitness of the bins
- The final main part is labels. The number of labels without exception will be one lower than the number of bins.
Note: For any NA values, the result will be stored as NA. Out of bounds values will also be NA in the resultant categorical bins.
On using the pandas cut function, it fails to guarantee the distribution of values in each bin. As a matter of fact, we might end up defining bins in such a way that the bin may not contain any value.
Syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
Parameters:
- x: Input array. Need to be 1-dimensional.
- bins: Denotes the bin boundaries for segmentation
- right: Denotes whether rightmost edge of bins should be included or not. Boolean type of value. Default value is True.
- labels: Defines labels for returned segmented bins. Array or boolean
Return Value: Returns a Categorical series/numpy array/IntervalIndex
Example 1: Let’s say we have an array ‘Age’ of 15 random numbers from 1 to 100 and we wish to separate data into 4 bins of categories –
'Baby/Toddler' :- 0 to 3 years 'Child' :- 4 to 17 years 'Adult' :- 18 to 63 years 'Elderly' :- 64 to 99 years
Python3
# Importing pandas and numpy libraries import pandas as pd import numpy as np # Creating a dummy DataFrame of 15 numbers randomly # ranging from 1-100 for age df = pd.DataFrame({ 'Age' : [ 42 , 15 , 67 , 55 , 1 , 29 , 75 , 89 , 4 , 10 , 15 , 38 , 22 , 77 ]}) # Printing DataFrame Before sorting Continuous # to Categories print ( "Before: " ) print (df) # A column of name 'Label' is created in DataFrame # Categorizing Age into 4 Categories # Baby/Toddler: (0,3], 0 is excluded & 3 is included # Child: (3,17], 3 is excluded & 17 is included # Adult: (17,63], 17 is excluded & 63 is included # Elderly: (63,99], 63 is excluded & 99 is included df[ 'Label' ] = pd.cut(x = df[ 'Age' ], bins = [ 0 , 3 , 17 , 63 , 99 ], labels = [ 'Baby/Toddler' , 'Child' , 'Adult' , 'Elderly' ]) # Printing DataFrame after sorting Continuous to # Categories print ( "After: " ) print (df) # Check the number of values in each bin print ( "Categories: " ) print (df[ 'Label' ].value_counts()) |
Output:
Before: Age 0 42 1 15 2 67 3 55 4 1 5 29 6 75 7 89 8 4 9 10 10 15 11 38 12 22 13 77 After: Age Label 0 42 Adult 1 15 Child 2 67 Elderly 3 55 Adult 4 1 Baby/Toddler 5 29 Adult 6 75 Elderly 7 89 Elderly 8 4 Child 9 10 Child 10 15 Child 11 38 Adult 12 22 Adult 13 77 Elderly Categories: Adult 5 Elderly 4 Child 4 Baby/Toddler 1 Name: Label, dtype: int64
Example #2: Let’s say we have an array ‘Height’ of 12 random people starting from 150cm to 180cm and we wish to separate data into 3 bins of categories.
'Short' :- greater than 150cm upto 157cm 'Average' :- greater than 157cm upto 170cm 'Tall' :- greater than 170cm upto 180cm
Python3
# Importing pandas and numpy libraries import pandas as pd import numpy as np # Creating a dummy DataFrame of 12 numbers randomly # ranging from 150-180 for height df = pd.DataFrame({ 'Height' : [ 150.4 , 157.6 , 170 , 176 , 164.2 , 155 , 159.2 , 175 , 162.4 , 176 , 153 , 170.9 ]}) # Printing DataFrame Before Sorting Continuous to Categories print ( "Before: " ) print (df) # A column of name 'Label' is created in DataFrame # Categorizing Height into 3 Categories # Short: (150,157], 150 is excluded & 157 is included # Average: (157,169], 157 is excluded & 169 is included # Tall: (169,180], 169 is excluded & 180 is included df[ 'Label' ] = pd.cut(x = df[ 'Height' ], bins = [ 150 , 157 , 169 , 180 ], labels = [ 'Short' , 'Average' , 'Tall' ]) # Printing DataFrame After Sorting Continuous to Categories print ( "After: " ) print (df) # Check the number of values in each bin print ( "Categories: " ) print (df[ 'Label' ].value_counts()) |
Output:
Before: Height 0 150.4 1 157.6 2 170.0 3 176.0 4 164.2 5 155.0 6 159.2 7 175.0 8 162.4 9 176.0 10 153.0 11 170.9 After: Height Label 0 150.4 Short 1 157.6 Average 2 170.0 Tall 3 176.0 Tall 4 164.2 Average 5 155.0 Short 6 159.2 Average 7 175.0 Tall 8 162.4 Average 9 176.0 Tall 10 153.0 Short 11 170.9 Tall Categories: Tall 5 Average 4 Short 3 Name: Label, dtype: int64