In this article, we will discuss how to count rows based on conditions in Pyspark dataframe.
For this, we are going to use these methods:
- Using where() function.
- Using filter() function.
Creating Dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of students data data = [[ "1" , "sravan" , "vignan" ], [ "2" , "ojaswi" , "vvit" ], [ "3" , "rohith" , "vvit" ], [ "4" , "sridevi" , "vignan" ], [ "1" , "sravan" , "vignan" ], [ "5" , "gnanesh" , "iit" ]] # specify column names columns = [ 'ID' , 'NAME' , 'college' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data,columns) print ( 'Actual data in dataframe' ) dataframe.show() |
Output:
Note: If we want to get all row count we can use count() function
Syntax: dataframe.count()
Where, dataframe is the pyspark input dataframe
Example: Python program to get all row count
Python3
print ( 'Total rows in dataframe' ) dataframe.count() |
Output:
Total rows in dataframe 6
Method 1: using where()
where(): This clause is used to check the condition and give the results
Syntax: dataframe.where(condition)
Where the condition is the dataframe condition
Example 1: Condition to get rows in dataframe where ID =1
Python3
# condition to get rows in dataframe # where ID =1 print ('Total rows in dataframe where\ ID = 1 with where clause') print (dataframe.where(dataframe. ID = = '1' ).count()) print ( 'They are ' ) dataframe.where(dataframe. ID = = '1' ).show() |
Output:
Example 2: Condition to get rows in dataframe with multiple conditions.
Python3
# condition to get rows in dataframe # where ID not equal to 1 print ('Total rows in dataframe where\ ID except 1 with where clause') print (dataframe.where(dataframe. ID ! = '1' ).count()) # condition to get rows in dataframe # where college is equal to vignan print ('Total rows in dataframe where\ college is vignan with where clause') print (dataframe.where(dataframe.college = = 'vignan' ).count()) # condition to get rows in dataframe # where id greater than 2 print ('Total rows in dataframe where ID greater\ than 2 with where clause') print (dataframe.where(dataframe. ID > 2 ).count()) |
Output:
Total rows in dataframe where ID except 1 with where clause
4
Total rows in dataframe where college is vignan with where clause
3
Total rows in dataframe where ID greater than 2 with where clause
3
Example 3: Python program for multiple conditions
Python3
# condition to get rows in dataframe # where ID not equal to 1 and name is sridevi print ('Total rows in dataframe where ID \ not equal to 1 and name is sridevi') print (dataframe.where((dataframe. ID ! = '1' ) & (dataframe.NAME = = 'sridevi' ) ).count()) # condition to get rows in dataframe # where college is equal to vignan or iit print ('Total rows in dataframe where college is \ vignan or iit with where clause') print (dataframe.where((dataframe.college = = 'vignan' ) | (dataframe.college = = 'iit' )).count()) |
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with where clause
4
Method 2: Using filter()
filter(): This clause is used to check the condition and give the results, Both are similar
Syntax: dataframe.filter(condition)
Example 1: Python program to get rows where id = 1
Python3
# condition to get rows in # dataframe where ID =1 print ('Total rows in dataframe where\ ID = 1 with filter clause') print (dataframe. filter (dataframe. ID = = '1' ).count()) print ( 'They are ' ) dataframe. filter (dataframe. ID = = '1' ).show() |
Output:
Example 2: Python program for multiple conditions
Python3
# condition to get rows in dataframe # where ID not equal to 1 and name is sridevi print ('Total rows in dataframe where ID not \ equal to 1 and name is sridevi') print (dataframe. filter ((dataframe. ID ! = '1' ) & (dataframe.NAME = = 'sridevi' )).count()) # condition to get rows in dataframe # where college is equal to vignan or iit print ('Total rows in dataframe where college\ is vignan or iit with filter clause') print (dataframe. filter ((dataframe.college = = 'vignan' ) | (dataframe.college = = 'iit' )).count()) |
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with filter clause
4