Pandas dataframe.groupby() function is one of the most useful function in the library it splits the data into groups based on columns/conditions and then apply some operations eg. size() which counts the number of entries/rows in each group. The groupby() can also be applied on series.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Parameters :
by : mapping, function, str, or iterable
axis : int, default 0
level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : When calling apply, add group keys to index to identify pieces
squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type
Returns : GroupBy object
In the following example, we are going to make use to two libraries seaborn and pandas where seaborn is used for plotting and pandas for reading data. We are going to use the load_dataset() methods from seaborn to load the penguins.csv data set.
Python3
# import the module import seaborn as sns dataset = sns.load_dataset( 'penguins' ) # displaying the data print (dataset.head()) |
Output :
More information about the data set using the info() method
Python3
# display the number of columns and their data types dataset.info() |
Output :
We will be grouping the data using the groupby() method according to ‘island’ and plotting it.
Plotting using Pandas :
Python3
# apply groupby on the island column # plotting dataset.groupby([ 'island' ]).size().plot(kind = "bar" ) |
Plotting using Seaborn
Python3
# use the groupby() function to group island column # and apply size() function # size() is equivalent to counting the distinct rows result = dataset.groupby([ 'island' ]).size() # plot the result sns.barplot(x = result.index, y = result.values) |