Select Columns that Satisfy a Condition in PySpark

27 July 2024

0

In this article, we are going to select columns in the dataframe based on the condition using the where() function in Pyspark.

Let’s create a sample dataframe with employee data.

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Company'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display dataframe 
dataframe.show() 

Output:

The where() method

This method is used to return the dataframe based on the given condition. It can take a condition and returns the dataframe

Syntax:

where(dataframe.column condition)

Here dataframe is the input dataframe
The column is the column name where we have to raise a condition

The select() method

After applying the where clause, we will select the data from the dataframe

Syntax:

dataframe.select('column_name').where(dataframe.column condition)

Here dataframe is the input dataframe
The column is the column name where we have to raise a condition

Example 1: Python program to return ID based on condition

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"],  
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Company'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select ID where ID less than 3 
dataframe.select('ID').where(dataframe.ID < 3).show() 

Output:

Example 2: Python program to select ID and name where ID =4.

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Company'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select ID and name  where ID =4 
dataframe.select(['ID', 'NAME']).where(dataframe.ID == 4).show() 

Output:

Example 3: Python program to select all column based on condition

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 
        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  
        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Company'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select all columns e  where name = sridevi 
dataframe.select(['ID', 'NAME', 'Company']).where( 
    dataframe.NAME == 'sridevi').show() 

Output:

Select Columns that Satisfy a Condition in PySpark

Python3

The where() method

The select() method

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

How To Install PHP 8.2 on Ubuntu 22.04|20.04|18.04

Recent Comments

EDITOR PICKS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR POSTS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR CATEGORY

ABOUT US

FOLLOW US