Saturday, September 21, 2024
Google search engine
HomeLanguagesSelect Columns that Satisfy a Condition in PySpark

Select Columns that Satisfy a Condition in PySpark

In this article, we are going to select columns in the dataframe based on the condition using the where() function in Pyspark. 

Let’s create a sample dataframe with employee data.

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"],
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"], 
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display dataframe
dataframe.show()


Output:

The where() method

This method is used to return the dataframe based on the given condition. It can take a condition and returns the dataframe

Syntax:

where(dataframe.column condition)
  1. Here dataframe is the input dataframe
  2. The column  is the column name where we have to raise a condition

The select() method

After applying the where clause, we will select the data from the dataframe

Syntax:

dataframe.select('column_name').where(dataframe.column condition)
  1. Here dataframe is the input dataframe
  2. The column is the column name where we have to raise a condition

Example 1: Python program to return ID based on condition

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"], 
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select ID where ID less than 3
dataframe.select('ID').where(dataframe.ID < 3).show()


Output:

Example 2: Python program to select ID and name  where ID =4.

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"],
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"], 
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select ID and name  where ID =4
dataframe.select(['ID', 'NAME']).where(dataframe.ID == 4).show()


Output:

Example 3: Python program to select all column based on condition

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"],
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"], 
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select all columns e  where name = sridevi
dataframe.select(['ID', 'NAME', 'Company']).where(
    dataframe.NAME == 'sridevi').show()


Output:

RELATED ARTICLES

Most Popular

Recent Comments