In this article, we are going to find the sum of PySpark dataframe column in Python. We are going to find the sum in a column using agg() function.
Let’s create a sample dataframe.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of students data data = [[ "1" , "sravan" , "vignan" , 67 , 89 ], [ "2" , "ojaswi" , "vvit" , 78 , 89 ], [ "3" , "rohith" , "vvit" , 100 , 80 ], [ "4" , "sridevi" , "vignan" , 78 , 80 ], [ "1" , "sravan" , "vignan" , 89 , 98 ], [ "5" , "gnanesh" , "iit" , 94 , 98 ]] # specify column names columns = [ 'student ID' , 'student NAME' , 'college' , 'subject 1' , 'subject 2' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe dataframe.show() |
Output:
Using agg() method:
The agg() method returns the aggregate sum of the passed parameter column.
Syntax:
dataframe.agg({'column_name': 'sum'})
Where,
- The dataframe is the input dataframe
- The column_name is the column in the dataframe
- The sum is the function to return the sum.
Example 1: Python program to find the sum in dataframe column
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of students data data = [[ "1" , "sravan" , "vignan" , 67 , 89 ], [ "2" , "ojaswi" , "vvit" , 78 , 89 ], [ "3" , "rohith" , "vvit" , 100 , 80 ], [ "4" , "sridevi" , "vignan" , 78 , 80 ], [ "1" , "sravan" , "vignan" , 89 , 98 ], [ "5" , "gnanesh" , "iit" , 94 , 98 ]] # specify column names columns = [ 'student ID' , 'student NAME' , 'college' , 'subject 1' , 'subject 2' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # find sum of subjects column dataframe.agg({ 'subject 1' : 'sum' }).show() |
Output:
Example 2: Get sum value from multiple columns
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of students data data = [[ "1" , "sravan" , "vignan" , 67 , 89 ], [ "2" , "ojaswi" , "vvit" , 78 , 89 ], [ "3" , "rohith" , "vvit" , 100 , 80 ], [ "4" , "sridevi" , "vignan" , 78 , 80 ], [ "1" , "sravan" , "vignan" , 89 , 98 ], [ "5" , "gnanesh" , "iit" , 94 , 98 ]] # specify column names columns = [ 'student ID' , 'student NAME' , 'college' , 'subject 1' , 'subject 2' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # find sum of multiple column dataframe.agg({ 'subject 1' : 'sum' , 'student ID' : 'sum' , 'subject 2' : 'sum' }).show() |
Output: