PySpark – Merge Two DataFrames with Different Columns or Schema

27 July 2024

1

In this article, we will discuss how to merge two dataframes with different amounts of columns or schema in PySpark in Python.

Let’s consider the first dataframe:

Here we are having 3 columns named id, name, and address for better demonstration purpose.

Python3

# importing module 
import pyspark 
  
# import when and lit function 
from pyspark.sql.functions import when, lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [["1", "sravan", "kakumanu"], 
        ["2", "ojaswi", "hyd"], 
        ["3", "rohith", "delhi"], 
        ["4", "sridevi", "kakumanu"], 
        ["5", "bobby", "guntur"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Address'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# display 
dataframe1.show() 

Output:

Let’s consider the second dataframe

Here we are going to create a dataframe with 2 columns.

Python3

# importing module 
import pyspark 
  
# import when and lit function 
from pyspark.sql.functions import when, lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data 
data = [["1", 23], 
        ["2", 21], 
        ["3", 32], 
        ] 
  
# specify column names 
columns = ['ID', 'Age'] 
  
# creating a dataframe from the lists of data 
dataframe2 = spark.createDataFrame(data, columns) 
  
# display 
dataframe2.show() 

Output:

We can not merge the data frames because the columns are different, so we have to add the missing columns. Here In first dataframe (dataframe1) , the columns [‘ID’, ‘NAME’, ‘Address’] and second dataframe (dataframe2 ) columns are [‘ID’,’Age’].

Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.sql.functions which are used to add a column with a value. Here we are going to add a value with None.

Syntax:

for column in [column for column in dataframe1.columns if column not in dataframe2.columns]:

dataframe2 = dataframe2.withColumn(column, lit(None))

where,

dataframe1 is the firstdata frame
dataframe2 is the second dataframe

Add missing columns to both the dataframes

In both the data frames we are going to add the Age column to the first dataframe and NAME and Address in the second dataframe using the above syntax.

Finally, we are displaying the column names of both data frames.

Python3

# importing module 
import pyspark 
  
# import lit function 
from pyspark.sql.functions import lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# list  of employee data 
data = [["1", "sravan", "kakumanu"], 
        ["2", "ojaswi", "hyd"], 
        ["3", "rohith", "delhi"], 
        ["4", "sridevi", "kakumanu"], 
        ["5", "bobby", "guntur"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Address'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [["1", 23], 
        ["2", 21], 
        ["3", 32], 
        ] 
  
# specify column names 
columns = ['ID', 'Age'] 
  
# creating a dataframe from the lists of data 
dataframe2 = spark.createDataFrame(data, columns) 
  
# add columns in dataframe1 that are missing  
# from dataframe2 
for column in [column for column in dataframe2.columns 
               if column not in dataframe1.columns]: 
    dataframe1 = dataframe1.withColumn(column, lit(None)) 
  
# add columns in dataframe2 that are missing  
# from dataframe1 
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]: 
    dataframe2 = dataframe2.withColumn(column, lit(None)) 
  
# now see the columns of dataframe1 
print(dataframe1.columns) 
  
# now see the columns of dataframe2 
print(dataframe2.columns) 

Output:

['ID', 'NAME', 'Address', 'Age']
['ID', 'Age', 'NAME', 'Address']

Merging Dataframes

Method 1: Using union()

This will merge the data frames based on the position.

Syntax:

dataframe1.union(dataframe2)

Example:

In this example, we are going to merge the two data frames using union() method after adding the required columns to both the data frames. Finally, we are displaying the dataframe that is merged.

Python3

# importing module 
import pyspark 
  
# import lit function 
from pyspark.sql.functions import lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# list  of employee data 
data = [["1", "sravan", "kakumanu"], 
        ["2", "ojaswi", "hyd"], 
        ["3", "rohith", "delhi"], 
        ["4", "sridevi", "kakumanu"], 
        ["5", "bobby", "guntur"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Address'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [["1", 23], 
        ["2", 21], 
        ["3", 32], 
        ] 
  
# specify column names 
columns = ['ID', 'Age'] 
  
# creating a dataframe from the lists of data 
dataframe2 = spark.createDataFrame(data, columns) 
  
# add columns in dataframe1 that are missing  
# from dataframe2 
for column in [column for column in dataframe2.columns 
               if column not in dataframe1.columns]: 
    dataframe1 = dataframe1.withColumn(column, lit(None)) 
  
# add columns in dataframe2 that are missing  
# from dataframe1 
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]: 
    dataframe2 = dataframe2.withColumn(column, lit(None)) 
  
# perform union 
dataframe1.union(dataframe2).show() 

Output:

Method 2: Using unionByName()

This will merge the two data frames based on the column name.

Syntax:

dataframe1.unionByName(dataframe2)

Example:

In this example, we are going to merge the two data frames using unionByName() method after adding the required columns to both the dataframes. Finally, we are displaying the dataframe that is merged.

Python3

# importing module 
import pyspark 
  
# import lit function 
from pyspark.sql.functions import lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# list  of employee data 
data = [["1", "sravan", "kakumanu"], 
        ["2", "ojaswi", "hyd"], 
        ["3", "rohith", "delhi"], 
        ["4", "sridevi", "kakumanu"], 
        ["5", "bobby", "guntur"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Address'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [["1", 23], 
        ["2", 21], 
        ["3", 32], 
        ] 
  
# specify column names 
columns = ['ID', 'Age'] 
  
# creating a dataframe from the lists of data 
dataframe2 = spark.createDataFrame(data, columns) 
  
# add columns in dataframe1 that are missing  
# from dataframe2 
for column in [column for column in dataframe2.columns 
               if column not in dataframe1.columns]: 
    dataframe1 = dataframe1.withColumn(column, lit(None)) 
  
# add columns in dataframe2 that are missing  
# from dataframe1 
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]: 
    dataframe2 = dataframe2.withColumn(column, lit(None)) 
  
# perform unionByName 
dataframe1.unionByName(dataframe2).show() 

Output:

Method 3: Using unionAll()

Syntax:

dataframe1.unionAll(dataframe2)

Example:

In this example, we are going to merge the two dataframes using unionAll() method after adding the required columns to both the dataframes. Finally, we are displaying the dataframe that is merged.

Python3

# importing module 
import pyspark 
  
# import lit function 
from pyspark.sql.functions import lit 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# list  of employee data 
data = [["1", "sravan", "kakumanu"], 
        ["2", "ojaswi", "hyd"], 
        ["3", "rohith", "delhi"], 
        ["4", "sridevi", "kakumanu"], 
        ["5", "bobby", "guntur"]] 
  
# specify column names 
columns = ['ID', 'NAME', 'Address'] 
  
# creating a dataframe from the lists of data 
dataframe1 = spark.createDataFrame(data, columns) 
  
# list  of employee data 
data = [["1", 23], 
        ["2", 21], 
        ["3", 32], 
        ] 
  
# specify column names 
columns = ['ID', 'Age'] 
  
# creating a dataframe from the lists of data 
dataframe2 = spark.createDataFrame(data, columns) 
  
# add columns in dataframe1 that are missing 
# from dataframe2 
for column in [column for column in dataframe2.columns 
               if column not in dataframe1.columns]: 
    dataframe1 = dataframe1.withColumn(column, lit(None)) 
  
# add columns in dataframe2 that are missing 
# from dataframe1 
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]: 
    dataframe2 = dataframe2.withColumn(column, lit(None)) 
  
# perform unionAll 
dataframe1.unionAll(dataframe2).show() 

Output:

PySpark – Merge Two DataFrames with Different Columns or Schema

Let’s consider the first dataframe:

Python3

Let’s consider the second dataframe

Python3

Add missing columns to both the dataframes

Python3

Merging Dataframes

Method 1: Using union()

Python3

Method 2: Using unionByName()

Python3

Method 3: Using unionAll()

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Watch Modern Family From Anywhere in 2025 by Gjurgjica Panova

Is Wattpad Safe for Kids? What Parents Need to Know in 2025 by Toma Novakovic

5 Best Free VPNs for Europe in 2025: Fast & Safe by Raven Wu

How to Get a CyberGhost Free Trial: Works 100% in 2025 by Raven Wu

Recent Comments

EDITOR PICKS

How to Watch Modern Family From Anywhere in 2025 by Gjurgjica Panova

Is Wattpad Safe for Kids? What Parents Need to Know in 2025 by Toma Novakovic

5 Best Free VPNs for Europe in 2025: Fast & Safe by Raven Wu

POPULAR POSTS

How to Watch Modern Family From Anywhere in 2025 by Gjurgjica Panova

Is Wattpad Safe for Kids? What Parents Need to Know in 2025 by Toma Novakovic

5 Best Free VPNs for Europe in 2025: Fast & Safe by Raven Wu

POPULAR CATEGORY

ABOUT US

FOLLOW US