Sunday, November 17, 2024
Google search engine
HomeLanguagesCount rows based on condition in Pyspark Dataframe

Count rows based on condition in Pyspark Dataframe

In this article, we will discuss how to count rows based on conditions in Pyspark dataframe.

For this, we are going to use these methods:

  • Using where() function.
  • Using filter() function.

Creating Dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data 
data =[["1","sravan","vignan"],
       ["2","ojaswi","vvit"],
       ["3","rohith","vvit"],
       ["4","sridevi","vignan"],
       ["1","sravan","vignan"], 
       ["5","gnanesh","iit"]]
  
# specify column names
columns = ['ID','NAME','college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
print('Actual data in dataframe')
dataframe.show()


Output:

Note: If we want to get all row count we can use count() function

Syntax: dataframe.count()

Where, dataframe is the pyspark input dataframe

Example: Python program to get all row count

Python3




print('Total rows in dataframe')
dataframe.count()


Output:

Total rows in dataframe
6

Method 1: using where()

where(): This clause is used to check the condition and give the results

Syntax: dataframe.where(condition)

Where the condition is the dataframe condition

Example 1: Condition to get rows in dataframe where ID =1

Python3




# condition to get rows in dataframe 
# where ID =1
print('Total rows in dataframe where\
ID = 1 with where clause')
print(dataframe.where(dataframe.ID == '1').count())
  
print('They are  ')
dataframe.where(dataframe.ID == '1').show()


Output:

Example 2: Condition to get rows in dataframe with multiple conditions.

Python3




# condition to get rows in dataframe
# where ID not equal to 1
print('Total rows in dataframe where\
ID except 1 with where clause')
  
print(dataframe.where(dataframe.ID != '1').count())
  
# condition to get rows in dataframe
# where college is equal to vignan
print('Total rows in dataframe where\
college is vignan with where clause')
print(dataframe.where(dataframe.college == 'vignan').count())
  
  
# condition to get rows in dataframe
# where id greater than 2
print('Total rows in dataframe where ID greater\
than 2 with where clause')
print(dataframe.where(dataframe.ID > 2).count())


Output:

Total rows in dataframe where ID except 1 with where clause

4

Total rows in dataframe where college is vignan with where clause

3

Total rows in dataframe where ID greater than 2 with where clause

3

Example 3: Python program for multiple conditions

Python3




# condition to get rows in dataframe
# where ID not equal to 1 and name is sridevi
print('Total rows in dataframe where ID \
not equal to 1 and name is sridevi')
print(dataframe.where((dataframe.ID != '1') &
                      (dataframe.NAME == 'sridevi')
                     ).count())
  
# condition to get rows in dataframe
# where college is equal to vignan or iit
print('Total rows in dataframe where college is\
vignan or iit with where clause')
print(dataframe.where((dataframe.college == 'vignan') |
                      (dataframe.college == 'iit')).count())


Output:

Total rows in dataframe where ID not equal to 1 and name is sridevi

1

Total rows in dataframe where college is vignan or iit with where clause

4

Method 2: Using filter()

filter(): This clause is used to check the condition and give the results, Both are similar

Syntax: dataframe.filter(condition)

Example 1: Python program to get rows where id = 1

Python3




# condition to get rows in
# dataframe where ID =1
print('Total rows in dataframe where\
ID = 1 with filter clause')
print(dataframe.filter(dataframe.ID == '1').count())
  
print('They are  ')
dataframe.filter(dataframe.ID == '1').show()


Output:

Example 2: Python program for multiple conditions

Python3




# condition to get rows in dataframe
# where ID not equal to 1 and name is sridevi
print('Total rows in dataframe where ID not\
equal to 1 and name is sridevi')
print(dataframe.filter((dataframe.ID != '1') &
                       (dataframe.NAME == 'sridevi')).count())
  
# condition to get rows in dataframe
# where college is equal to vignan or iit
print('Total rows in dataframe where college\
is vignan or iit with filter clause')
print(dataframe.filter((dataframe.college == 'vignan') |
                       (dataframe.college == 'iit')).count())


Output:

Total rows in dataframe where ID not equal to 1 and name is sridevi

1

Total rows in dataframe where college is vignan or iit with filter clause

4

RELATED ARTICLES

Most Popular

Recent Comments