Pyspark dataframe: Summing column while grouping over another

27 July 2024

0

In this article, we will discuss how to sum a column while grouping another in Pyspark dataframe using Python.

Let’s create the dataframe for demonstration:

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display
dataframe.show()

Output:

Method 1: Using groupBy() Method

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. Here the aggregate function is sum().

sum(): This will return the total values for each group.

Syntax: dataframe.groupBy(‘column_name_group’).sum(‘column_name’)

Example: Groupby with DEPT along with FEE with sum()

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Groupby with DEPT along FEE with sum()
dataframe.groupBy('DEPT').sum('FEE').show()

Output:

Method 2: Using agg() function with GroupBy()

Here we have to import the sum function from sql.functions module to be used with the aggregate method.

Syntax: dataframe.groupBy(“group_column”).agg(sum(“column_name”))

where,

dataframe is the pyspark dataframe

group_column is the grouping column

column_name is the column to get sum

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import sum
from pyspark.sql.functions import sum
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Groupby with DEPT and NAME with sum()
dataframe.groupBy("DEPT").agg(sum("FEE")).show()

Output:

Method 3: Using Window function with sum

The window function is used for partitioning the columns in the dataframe.

Syntax: Window.partitionBy(‘column_name_group’)

where, column_name_group is the column that contains multiple values for partition

We can partition the data column that contains group values and then use the aggregate function of sum() to get the sum of the grouping(partitioning) column.

Syntax: dataframe.withColumn(‘New_Column_name’, functions.sum(‘column_name’).over(Window.partitionBy(‘column_name_group’)))

where,

withColumn() method is used to get the column name

functions.sum(‘column_name’) is to get the sum

Window.partitionBy(‘column_name_group’) is to partition the column with sum by group

Example: Get the sum of fee based on the department

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import functions
from pyspark.sql import functions as f
 
# import window module
from pyspark.sql import Window
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# summing using window function
dataframe.withColumn('Total Branch Sum', f.sum(
    'FEE').over(Window.partitionBy('DEPT'))).show()

Output:

Pyspark dataframe: Summing column while grouping over another

Let’s create the dataframe for demonstration:

Python3

Method 1: Using groupBy() Method

Example: Groupby with DEPT along with FEE with sum()

Python3

Method 2: Using agg() function with GroupBy()

Python3

Method 3: Using Window function with sum

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US