Pyspark – Aggregation on multiple columns

27 July 2024

0

In this article, we will discuss how to perform aggregation on multiple columns in Pyspark using Python. We can do this by using Groupby() function

Let’s create a dataframe for demonstration:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display 
dataframe.show() 

Output:

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data

The aggregation operation includes:

count(): This will return the count of rows for each group.

dataframe.groupBy(‘column_name_group’).count()

mean(): This will return the mean of values for each group.

dataframe.groupBy(‘column_name_group’).mean(‘column_name’)

max(): This will return the maximum of values for each group.

dataframe.groupBy(‘column_name_group’).max(‘column_name’)

min(): This will return the minimum of values for each group.

dataframe.groupBy(‘column_name_group’).min(‘column_name’)

sum(): This will return the total values for each group.

dataframe.groupBy(‘column_name_group’).sum(‘column_name’)

avg(): This will return the average for values for each group.

dataframe.groupBy(‘column_name_group’).avg(‘column_name’).show()

We can groupBy and aggregate on multiple columns at a time by using the following syntax:

dataframe.groupBy(‘column_name_group1′,’column_name_group2′,…………,’column_name_group n’).aggregate_operation(‘column_name’)

Example 1: Groupby with mean() function with DEPT and NAME

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# Groupby with DEPT and NAME with mean() 
dataframe.groupBy('DEPT', 'NAME').mean('FEE').show() 

Output:

Example 2: Aggregation on all columns

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# Groupby with DEPT,ID and NAME with mean() 
dataframe.groupBy('DEPT', 'ID', 'NAME').mean('FEE').show() 

Output:

Pyspark – Aggregation on multiple columns

Let’s create a dataframe for demonstration:

Python3

The aggregation operation includes:

Example 1: Groupby with mean() function with DEPT and NAME

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Recent Comments

EDITOR PICKS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR POSTS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR CATEGORY

ABOUT US

FOLLOW US