In this article, we are going to select columns in the dataframe based on the condition using the where() function in Pyspark.
Let’s create a sample dataframe with employee data.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe dataframe.show() |
Output:
The where() method
This method is used to return the dataframe based on the given condition. It can take a condition and returns the dataframe
Syntax:
where(dataframe.column condition)
- Here dataframe is the input dataframe
- The column is the column name where we have to raise a condition
The select() method
After applying the where clause, we will select the data from the dataframe
Syntax:
dataframe.select('column_name').where(dataframe.column condition)
- Here dataframe is the input dataframe
- The column is the column name where we have to raise a condition
Example 1: Python program to return ID based on condition
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # select ID where ID less than 3 dataframe.select( 'ID' ).where(dataframe. ID < 3 ).show() |
Output:
Example 2: Python program to select ID and name where ID =4.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # select ID and name where ID =4 dataframe.select([ 'ID' , 'NAME' ]).where(dataframe. ID = = 4 ).show() |
Output:
Example 3: Python program to select all column based on condition
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # select all columns e where name = sridevi dataframe.select([ 'ID' , 'NAME' , 'Company' ]).where( dataframe.NAME = = 'sridevi' ).show() |
Output: