Merge two DataFrames with different amounts of columns in PySpark

27 July 2024

1

In this article, we will discuss how to perform union on two dataframes with different amounts of columns in PySpark in Python.

Let’s consider the first dataframe

Here we are having 3 columns named id, name, and address.

Python3

# importing module
import pyspark
 
# import when and lit function
from pyspark.sql.functions import when, lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "kakumanu"],
        ["2", "ojaswi", "hyd"],
        ["3", "rohith", "delhi"],
        ["4", "sridevi", "kakumanu"],
        ["5", "bobby", "guntur"]]
 
# specify column names
columns = ['ID', 'NAME', 'Address']
 
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
 
# display
dataframe1.show()

Output:

Let’s consider second dataframe

Here we are going to create dataframe with 2 columns

Python3

# importing module
import pyspark
 
# import when and lit function
from pyspark.sql.functions import when, lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", 23],
        ["2", 21],
        ["3", 32],
        ]
 
# specify column names
columns = ['ID', 'Age']
 
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
 
# display
dataframe2.show()

Output:

We can not perform union operations because the columns are different, so we have to add the missing columns. Here In first dataframe (dataframe1) , the columns [‘ID’, ‘NAME’, ‘Address’] and second dataframe (dataframe2 ) columns are [‘ID’,’Age’].

Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.sql.functions which is used to add a column with a value. Here we are going to add a value with None.

Syntax:

for column in [column for column in dataframe1.columns if column not in dataframe2.columns]:

dataframe2 = dataframe2.withColumn(column, lit(None))

where,

dataframe1 is the firstdata frame

dataframe2 is the second dataframe

Example: Add missing columns to both the dataframes

Python3

# importing module
import pyspark
 
# import lit function
from pyspark.sql.functions import lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
 
# list  of employee data
data = [["1", "sravan", "kakumanu"],
        ["2", "ojaswi", "hyd"],
        ["3", "rohith", "delhi"],
        ["4", "sridevi", "kakumanu"],
        ["5", "bobby", "guntur"]]
 
# specify column names
columns = ['ID', 'NAME', 'Address']
 
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
 
# list  of employee data
data = [["1", 23],
        ["2", 21],
        ["3", 32],
        ]
 
# specify column names
columns = ['ID', 'Age']
 
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
 
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns
               if column not in dataframe1.columns]:
    dataframe1 = dataframe1.withColumn(column, lit(None))
 
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]:
    dataframe2 = dataframe2.withColumn(column, lit(None))
 
# now see the columns of dataframe1
print(dataframe1.columns)
 
# now see the columns of dataframe2
print(dataframe2.columns)

Output:

['ID', 'NAME', 'Address', 'Age']
['ID', 'Age', 'NAME', 'Address']

Example 1: Using union()

Now we can perform union by using union() function. This function will join two dataframes.

Syntax: dataframe1.union(dataframe2)

Example:

Python3

# importing module
import pyspark
 
# import lit function
from pyspark.sql.functions import lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
 
# list  of employee data
data = [["1", "sravan", "kakumanu"],
        ["2", "ojaswi", "hyd"],
        ["3", "rohith", "delhi"],
        ["4", "sridevi", "kakumanu"],
        ["5", "bobby", "guntur"]]
 
# specify column names
columns = ['ID', 'NAME', 'Address']
 
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
 
# list  of employee data
data = [["1", 23],
        ["2", 21],
        ["3", 32],
        ]
 
# specify column names
columns = ['ID', 'Age']
 
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
 
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns 
               if column not in dataframe1.columns]:
    dataframe1 = dataframe1.withColumn(column, lit(None))
 
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns 
               if column not in dataframe2.columns]:
    dataframe2 = dataframe2.withColumn(column, lit(None))
 
# perform union
dataframe1.union(dataframe2).show()

Output:

Example 2: Using unionAll()

Syntax: dataframe1.unionAll(dataframe2)

Python3

# importing module
import pyspark
 
# import lit function
from pyspark.sql.functions import lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
 
# list  of employee data
data = [["1", "sravan", "kakumanu"],
        ["2", "ojaswi", "hyd"],
        ["3", "rohith", "delhi"],
        ["4", "sridevi", "kakumanu"],
        ["5", "bobby", "guntur"]]
 
# specify column names
columns = ['ID', 'NAME', 'Address']
 
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
 
# list  of employee data
data = [["1", 23],
        ["2", 21],
        ["3", 32],
        ]
 
# specify column names
columns = ['ID', 'Age']
 
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
 
# add columns in dataframe1 that are missing 
# from dataframe2
for column in [column for column in dataframe2.columns\
               if column not in dataframe1.columns]:
    dataframe1 = dataframe1.withColumn(column, lit(None))
 
# add columns in dataframe2 that are missing 
# from dataframe1
for column in [column for column in dataframe1.columns \
               if column not in dataframe2.columns]:
    dataframe2 = dataframe2.withColumn(column, lit(None))
 
# perform unionAll
dataframe1.unionAll(dataframe2).show()

Output:

Example 3: Using unionByName

We can also perform unionByName, This will join dataframes by name.

Syntax: dataframe1.unionByName(dataframe2)

Example:

Python3

# importing module
import pyspark
 
# import lit function
from pyspark.sql.functions import lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
 
# list  of employee data
data = [["1", "sravan", "kakumanu"],
        ["2", "ojaswi", "hyd"],
        ["3", "rohith", "delhi"],
        ["4", "sridevi", "kakumanu"],
        ["5", "bobby", "guntur"]]
 
# specify column names
columns = ['ID', 'NAME', 'Address']
 
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
 
# list  of employee data
data = [["1", 23],
        ["2", 21],
        ["3", 32],
        ]
 
# specify column names
columns = ['ID', 'Age']
 
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
 
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns \
               if column not in dataframe1.columns]:
    dataframe1 = dataframe1.withColumn(column, lit(None))
 
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns \
               if column not in dataframe2.columns]:
    dataframe2 = dataframe2.withColumn(column, lit(None))
 
# perform unionByName
dataframe1.unionByName(dataframe2).show()

Output:

Merge two DataFrames with different amounts of columns in PySpark

Let’s consider the first dataframe

Python3

Let’s consider second dataframe

Python3

Example: Add missing columns to both the dataframes

Python3

Example 1: Using union()

Python3

Example 2: Using unionAll()

Python3

Example 3: Using unionByName

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

2024년 중국에서 구글 이용하는 방법 by 주르지카 파노바

Now’s your chance to grab one of our favorite foldable phones at its lowest price yet

OnePlus design lead dishes on curved glass and the new flagship’s attention to detail

Pixel users report data drops after Google’s December update

Recent Comments

EDITOR PICKS

2024년 중국에서 구글 이용하는 방법 by 주르지카 파노바

Now’s your chance to grab one of our favorite foldable phones at its lowest price yet

OnePlus design lead dishes on curved glass and the new flagship’s attention to detail

POPULAR POSTS

2024년 중국에서 구글 이용하는 방법 by 주르지카 파노바

Now’s your chance to grab one of our favorite foldable phones at its lowest price yet

OnePlus design lead dishes on curved glass and the new flagship’s attention to detail

POPULAR CATEGORY

ABOUT US

FOLLOW US