Sunday, November 17, 2024
Google search engine
HomeLanguagesCount values by condition in PySpark Dataframe

Count values by condition in PySpark Dataframe

In this article, we are going to count the value of the Pyspark dataframe columns by condition.

Creating Dataframe for demonstration:

Python3




# importing module
import pyspark
 
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data with 10 row values
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "IT", 30000],
        ["3", "bobby", "business", 45000],
        ["4", "rohith", "IT", 45000],
        ["5", "gnanesh", "business", 120000],
        ["6", "siva nagulu", "sales", 23000],
        ["7", "bhanu", "sales", 34000],
        ["8", "sireesha", "business", 456798],
        ["9", "ravi", "IT", 230000],
        ["10", "devi", "business", 100000],
        ]
 
# specify column names
columns = ['ID', 'NAME', 'sector', 'salary']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display dataframe
dataframe.show()


Output:

Method 1: Using select(), where(), count()

where(): where is used to return the dataframe based on the given condition by selecting the rows in the dataframe or by extracting the particular rows or columns from the dataframe. It can take a condition and returns the dataframe

Syntax: where(dataframe.column condition)

Where, 

  • Here dataframe is the input dataframe
  • column is the column name where we have to raise a condition

count(): This function is used to return the number of values/rows in a dataframe

Syntax: dataframe.count()

Example 1: Python program to count values in NAME column where ID greater than 5

Python3




# count values in NAME column
# where ID greater than 5
dataframe.select('NAME').where(dataframe.ID>5).count()


 
 

Output:

5

 
Example 2: Python program to count values in all column count where ID greater than 3 and sector = IT

Python3




# count values in all column count
# where ID greater than 3 and sector = IT
dataframe.select().where((dataframe.ID>3) &
                         (dataframe.sector=='IT')).count()


 
 

Output:

2

Method 2: Using filter(), count()

filter(): It is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. It can take a condition and returns the dataframe

 

Syntax: filter(dataframe.column condition)

Where,

  • Here dataframe is the input dataframe
  • column  is the column name where we have to raise a condition

Example 1: Python program to count ID column  where ID =4

Python3




# count ID column  where ID =4
dataframe.select('ID').where(dataframe.ID == 4).count()


 
 

Output:

1

Example 2: Python program to count ID column  where ID > 4  and sector is sales or IT

Python3




# count ID column  where ID > 4 
# and sector is sales or IT
dataframe.select('ID').where((dataframe.ID>4) &
                             ((dataframe.sector=='sales')|
                              (dataframe.sector=='IT'))).count()


 
 

Output:

3

RELATED ARTICLES

Most Popular

Recent Comments