In this article, we are going to filter the rows based on column values in PySpark dataframe.
Creating Dataframe for demonstration:
Python3
# importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "1" , "sravan" , "company 1" ], [ "4" , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output:
Method 1: Using where() function
This function is used to check the condition and give the results
Syntax: dataframe.where(condition)
We are going to filter the rows by using column values through the condition, where the condition is the dataframe condition
Example 1: filter rows in dataframe where ID =1
Python3
# get the data where ID=1 dataframe.where(dataframe. ID = = '1' ).show() |
Output:
Example 2:
Python3
# get the data where name not 'sravan' dataframe.where(dataframe.NAME ! = 'sravan' ).show() |
Output:
Example 3: Where clause multiple column values filtering.
Python program to filter rows where ID greater than 2 and college is vvit
Python3
# filter rows where ID greater than 2 # and college is vvit dataframe.where((dataframe. ID > '2' ) & (dataframe.college = = 'vvit' )).show() |
Output:
Method 2: Using filter() function
This function is used to check the condition and give the results.
Syntax: dataframe.filter(condition)
Example 1: Python code to get column value = vvit college
Python3
# get the data where college is 'vvit' dataframe. filter (dataframe.college = = 'vvit' ).show() |
Output:
Example 2: filter the data where id > 3.
Python3
# get the data where id > 3 dataframe. filter (dataframe. ID > '3' ).show() |
Output:
Example 3: Multiple column value filtering.
Python program to filter rows where ID greater than 2 and college is vignan
Python3
# filter rows where ID greater # than 2 and college is vignan dataframe. filter ((dataframe. ID > '2' ) & (dataframe.college = = 'vignan' )).show() |
Output: