Wednesday, January 15, 2025
Google search engine
HomeLanguagesPyspark – Filter dataframe based on multiple conditions

Pyspark – Filter dataframe based on multiple conditions

In this article, we are going to see how to Filter dataframe based on multiple conditions.

Let’s Create a Dataframe for demonstration:

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of students  data
data = [["1", "Amit", "DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
 
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# show dataframe
dataframe.show()


Output:

Method 1: Using Filter()

filter(): It is a function which filters the columns/row based on SQL expression or condition.

Syntax: Dataframe.filter(Condition)

Where condition may be given Logical expression/ sql expression

Example 1: Filter single condition

Python3




dataframe.filter(dataframe.college == "DU").show()


Output:

Example 2: Filter columns with multiple conditions.

Python3




dataframe.filter((dataframe.college == "DU") &
                 (dataframe.student_ID == "1")).show()


Output:

Method 2: Using filter and SQL Col

Here we are going to use the SQL col function, this function refers the column name of the dataframe with dataframe_object.col.

Syntax: Dataframe_obj.col(column_name).

Where, Column_name is refers to the column name of dataframe.

Example 1: Filter column with a single condition.

Python3




# Using SQL col() function
from pyspark.sql.functions import col
dataframe.filter(col("college") == "DU").show()


Output:

Example 2: Filter column with multiple conditions.

Python3




# Using SQL col() function
from pyspark.sql.functions import col
 
 
dataframe.filter((col("college") == "DU") &
                 (col("student_NAME") == "Amit")).show()


Output:

Method 3: Using isin()

isin(): This function takes a list as a parameter and returns the boolean expression. The boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments

Syntax: isin(*list)

Where *list is extracted from of list.

Example 1: Filter with a single list.

Python3




list = [1, 2]
dataframe.filter(dataframe.student_ID.isin(list)).show()


Output:

Example 2: Filter with multiple lists.

Python3




Id_list = [1, 2]
college_list = ['DU','IIT']
dataframe.filter((dataframe.student_ID.isin(Id_list)) |
                 (dataframe.college.isin(college_list))).show()


Output:

Method 4: Using Startswith and endswith

Here we will use startswith and endswith function of pyspark.

startswith(): This function takes a character as a parameter and searches in the columns string whose string starting with the first character if the condition satisfied then returns True.

Syntax: startswith(character)

Example:

Python3




dataframe.filter(dataframe.student_NAME.startswith('s')).show()


Output:

endswith(): This function takes a character as a parameter and searches in the columns string whose string ending with the character if the condition satisfied then returns True.

Syntax: endswith(character)

Example:

Python3




dataframe.filter(dataframe.student_NAME.endswith('t')).show()


Output:

Here will use both functions for filtering the dataframe:

Python3




dataframe.filter((dataframe.student_NAME.endswith('t')) &
                 (dataframe.student_NAME.startswith("A"))).show()


Output:

RELATED ARTICLES

Most Popular

Recent Comments