Pyspark GroupBy DataFrame with Aggregation or Count

26 July 2024

2

Pyspark is a powerful tool for working with large datasets in a distributed environment using Python. One of the most common tasks in data manipulation is grouping data by one or more columns. This can be accomplished using the groupBy() function in Pyspark, which allows you to group a DataFrame based on the values in one or more columns. In this article, we will explore how to use the groupBy() function in Pyspark with aggregation or count.

Syntax of groupBy() Function

The groupBy() function in Pyspark is a powerful tool for working with large Datasets. It allows you to group DataFrame based on the values in one or more columns. The syntax of groupBy() function with its parameter is given below:

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Creating a Pyspark DataFrame

Here, we will create a simple DataFrame containing student data, with columns for the customer’s ID, NAME, DEPT, and FEE. We first import the necessary packages, we build an app with the app name “GroupBy” after that a data frame is created using the spark.createDataFrame() function.

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list of student data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display 
dataframe.show() 

Output:

Pyspark groupBy DataFrame with aggregation or count

Pyspark groupBy DataFrame with count

Here, we are using count(), It will return the count of rows for each group.

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list of student data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# Groupby with DEPT with count() 
dataframe.groupBy('DEPT').count().show() 

Output: