How to avoid duplicate columns after join in PySpark ?

28 July 2024

1

In this article, we will discuss how to avoid duplicate columns in DataFrame after join in PySpark using Python.

Create the first dataframe for demonstration:

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()

Output:

Create a second dataframe for demonstration:

Python3

# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
dataframe1.show()

Output:

Method 1: Using drop() function

We can join the dataframes using joins like inner join and after this join, we can use the drop method to remove one duplicate column.

Syntax: dataframe.join(dataframe1,dataframe.column_name == dataframe1.column_name,”inner”).drop(dataframe.column_name)

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

inner specifies inner join

drop() will delete the common column and delete first dataframe column

Example: Join two dataframes based on ID and remove duplicate ID in first dataframe

Python3

# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
               dataframe.ID == dataframe1.ID,
               "inner").drop(dataframe.ID).show()

Output:

Method 2: Using join()

Here we are simply using join to join two dataframes and then drop duplicate columns.

Syntax: dataframe.join(dataframe1, [‘column_name’]).show()

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

column_name is the common column exists in two dataframes

Example: Join based on ID and remove duplicates

Python3

# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()

Output:

How to avoid duplicate columns after join in PySpark ?

Create the first dataframe for demonstration:

Python3

Create a second dataframe for demonstration:

Python3

Method 1: Using drop() function

Example: Join two dataframes based on ID and remove duplicate ID in first dataframe

Python3

Method 2: Using join()

Example: Join based on ID and remove duplicates

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US