How to join on multiple columns in Pyspark?

27 July 2024

2

In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python.

Let’s create the first dataframe:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] 
  
# specify column names 
columns = ['ID1', 'NAME1'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
dataframe.show() 

Output:

Let’s create the second dataframe:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), 
        (3, "bobby"), 
        (4, "rohith"), (5, "gnanesh")] 
  
# specify column names 
columns = ['ID2', 'NAME2'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
dataframe1.show() 

Output:

we can join the multiple columns by using join() function using conditional operator

Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2))

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

column1 is the first matching column in both the dataframes

column2 is the second matching column in both the dataframes

Example 1: PySpark code to join the two dataframes with multiple columns (id and name)

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] 
  
# specify column names 
columns = ['ID1', 'NAME1'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"), 
        (4, "rohith"), (5, "gnanesh")] 
  
# specify column names 
columns = ['ID2', 'NAME2'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# join based on ID and name column 
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2) 
               & (dataframe.NAME1 == dataframe1.NAME2)).show() 

Output:

Example 2: Join with or operator

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] 
  
# specify column names 
columns = ['ID1', 'NAME1'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"), 
        (4, "rohith"), (5, "gnanesh")] 
  
# specify column names 
columns = ['ID2', 'NAME2'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# join based on ID and name column 
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2) 
               | (dataframe.NAME1 == dataframe1.NAME2)).show() 

Output:

How to join on multiple columns in Pyspark?

Python3

Python3

Example 1: PySpark code to join the two dataframes with multiple columns (id and name)

Python3

Example 2: Join with or operator

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Set Up a VPN on Any Device in 2024 (Full Guide) by Tim Mocan

The Pixel 9 Pro Fold proved you shouldn’t buy first-gen Google products

The latest One UI 7 beta hints at Samsung’s foldable plans for 2025

This might be our first glimpse at the OnePlus Open 2’s new design

Recent Comments

EDITOR PICKS

How to Set Up a VPN on Any Device in 2024 (Full Guide) by Tim Mocan

The Pixel 9 Pro Fold proved you shouldn’t buy first-gen Google products

The latest One UI 7 beta hints at Samsung’s foldable plans for 2025

POPULAR POSTS

How to Set Up a VPN on Any Device in 2024 (Full Guide) by Tim Mocan

The Pixel 9 Pro Fold proved you shouldn’t buy first-gen Google products

The latest One UI 7 beta hints at Samsung’s foldable plans for 2025

POPULAR CATEGORY

ABOUT US

FOLLOW US