Pyspark is a powerful tool for working with large datasets in a distributed environment using Python. One of the most common tasks in data manipulation is grouping data by one or more columns. This can be accomplished using the groupBy() function in Pyspark, which allows you to group a DataFrame based on the values in one or more columns. In this article, we will explore how to use the groupBy() function in Pyspark with aggregation or count.
Syntax of groupBy() Function
The groupBy() function in Pyspark is a powerful tool for working with large Datasets. It allows you to group DataFrame based on the values in one or more columns. The syntax of groupBy() function with its parameter is given below:
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Creating a Pyspark DataFrame
Here, we will create a simple DataFrame containing student data, with columns for the customer’s ID, NAME, DEPT, and FEE. We first import the necessary packages, we build an app with the app name “GroupBy” after that a data frame is created using the spark.createDataFrame() function.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [[ "1" , "sravan" , "IT" , 45000 ], [ "2" , "ojaswi" , "CS" , 85000 ], [ "3" , "rohith" , "CS" , 41000 ], [ "4" , "sridevi" , "IT" , 56000 ], [ "5" , "bobby" , "ECE" , 45000 ], [ "6" , "gayatri" , "ECE" , 49000 ], [ "7" , "gnanesh" , "CS" , 45000 ], [ "8" , "bhanu" , "Mech" , 21000 ] ] # specify column names columns = [ 'ID' , 'NAME' , 'DEPT' , 'FEE' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe.show() |
Output:
Pyspark groupBy DataFrame with count
Here, we are using count(), It will return the count of rows for each group.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [[ "1" , "sravan" , "IT" , 45000 ], [ "2" , "ojaswi" , "CS" , 85000 ], [ "3" , "rohith" , "CS" , 41000 ], [ "4" , "sridevi" , "IT" , 56000 ], [ "5" , "bobby" , "ECE" , 45000 ], [ "6" , "gayatri" , "ECE" , 49000 ], [ "7" , "gnanesh" , "CS" , 45000 ], [ "8" , "bhanu" , "Mech" , 21000 ] ] # specify column names columns = [ 'ID' , 'NAME' , 'DEPT' , 'FEE' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT with count() dataframe.groupBy( 'DEPT' ).count().show() |
Output:
Pyspark GroupBy DataFrame with Aggregation
Here, we are importing these agg functions from the module sql.functions. By using Groupby with DEPT with sum() , min() , max() we can collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data.
Python3
# importing module import pyspark # import sum, min,avg,count,mean and max functions from pyspark.sql.functions import sum , max , min , avg, count, mean # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [[ "1" , "sravan" , "IT" , 45000 ], [ "2" , "ojaswi" , "CS" , 85000 ], [ "3" , "rohith" , "CS" , 41000 ], [ "4" , "sridevi" , "IT" , 56000 ], [ "5" , "bobby" , "ECE" , 45000 ], [ "6" , "gayatri" , "ECE" , 49000 ], [ "7" , "gnanesh" , "CS" , 45000 ], [ "8" , "bhanu" , "Mech" , 21000 ] ] # specify column names columns = [ 'ID' , 'NAME' , 'DEPT' , 'FEE' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT with sum() , min() , max() dataframe.groupBy( "DEPT" ).agg( max ( "FEE" ), sum ( "FEE" ), min ( "FEE" ), mean( "FEE" ), count( "FEE" )).show() |
Output:
In conclusion,