Removing duplicate columns after DataFrame join in PySpark

27 July 2024

0

In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark.

Create the first dataframe for demonstration:

Python3

# Importing necessary libraries
from pyspark.sql import SparkSession
 
# Create a spark session
spark = SparkSession.builder.appName('pyspark \
- example join').getOrCreate()
 
# Create data in dataframe
data = [(('Ram'),1,'M'),
          (('Mike'),2,'M'),
          (('Rohini'),3,'M'),
          (('Maria'),4,'F'),
          (('Jenis'),5,'F')]
 
# Column names in dataframe
columns = ["Name","ID","Gender"]
 
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema = columns)
 
# Print the dataframe
df1.show()

Output:

+------+---+------+
|  Name| ID|Gender|
+------+---+------+
|   Ram|  1|     M|
|  Mike|  2|     M|
|Rohini|  3|     M|
| Maria|  4|     F|
| Jenis|  5|     F|
+------+---+------+

Create a second dataframe for demonstration:

Python3

# Create data in dataframe
data2 = [(1,3000),
          (2,4000),
          (3,4000),
          (4,4000),
          (5, 1200)]
 
# Column names in dataframe
columns = ["ID","salary"]
 
# Create the spark dataframe
df2 = spark.createDataFrame(data=data2, 
                            schema = columns)
 
# Print the dataframe
df2.show()

Output:

+---+------+
| ID|salary|
+---+------+
|  1|  3000|
|  2|  4000|
|  3|  4000|
|  4|  4000|
|  5|  1200|
+---+------+

Using join()

This will join the two dataframes

Syntax: dataframe.join(dataframe1).show()

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

Let’s see the dataframe after join:

Python3

df = df1.join(df2, df1.ID==df2.ID)
df.show()

Output:

+------+---+------+---+------+
|  Name| ID|Gender| ID|salary|
+------+---+------+---+------+
| Jenis|  5|     F|  5|  1200|
|   Ram|  1|     M|  1|  3000|
|Rohini|  3|     M|  3|  4000|
|  Mike|  2|     M|  2|  4000|
| Maria|  4|     F|  4|  4000|
+------+---+------+---+------+

Here we see the ID and Salary columns are added to our existing article.

Now, let check the columns once:

Here we check gender columns which is unique so its work fine.

Python3

df.select('Gender').show()

Output:

+------+
|Gender|
+------+
|     F|
|     M|
|     M|
|     M|
|     F|
+------+

Now let check our Duplicates Columns:

Here it will produce errors because of duplicate columns.

Python3

df.select('ID').show()

Output:

AnalysisException: Reference ‘ID’ is ambiguous, could be: ID, ID.

Removing duplicate columns after join in PySpark

If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.

Syntax: dataframe.join(dataframe1, [‘column_name’]).show()

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

column_name is the common column exists in two dataframes

Python3

new_df = df1.join(df2, ["id"])
new_df.show()

Output:

+---+------+------+------+
| ID|  Name|Gender|salary|
+---+------+------+------+
|  5| Jenis|     F|  1200|
|  1|   Ram|     M|  3000|
|  3|Rohini|     M|  4000|
|  2|  Mike|     M|  4000|
|  4| Maria|     F|  4000|
+---+------+------+------+

Removing duplicate columns after DataFrame join in PySpark

Create the first dataframe for demonstration:

Python3

Create a second dataframe for demonstration:

Python3

Using join()

Python3

Python3

Python3

Removing duplicate columns after join in PySpark

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

10 Best Cheap VPNs for Torrenting in 2025: Safe Downloads by Raven Wu

Recent Comments

EDITOR PICKS

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

POPULAR POSTS

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

POPULAR CATEGORY

ABOUT US

FOLLOW US