How to union multiple dataframe in PySpark?

27 July 2024

1

In this article, we will discuss how to union multiple data frames in PySpark.

Method 1: Union() function in pyspark

The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other.

Syntax: data_frame1.union(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes

Example 1:

Python3

# Python program to illustrate the
# working of union() function
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# union()
answer = data_frame1.union(data_frame2)
  
# Print the result of the union()
answer.show()

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

Example 2:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as union() can be applied on datasets having the same structure.

Python3

# Python program to illustrate the working
# of union() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      91.123|            Naveen|
|       90.51|            Piyush|
|       87.67|            Hitesh|
+------------+------------------+

Method 2: UnionByName() function in pyspark

The PySpark unionByName() function is also used to combine two or more data frames but it might be used to combine dataframes having different schema. This is because it combines data frames by the name of the column and not the order of the columns.

Syntax: data_frame1.unionByName(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes

Example 1:

In this example, both data frames, data_frame1 and data_frame2 are of the same schema.

Python3

# Python program to illustrate the working
# of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the result of the union()
answer.show()

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

Example 2:

In this example, data_frame1 and data_frame2 are of different schema but the output is the desired one.

Python3

# Python program to illustrate the
# working of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
|      Hitesh|             87.67|
+------------+------------------+

Example 3:

Let’s now consider two data frames that contain an unequal number of columns (entirely different schema). In this case, we need to pass an additional argument “allowMissingColumns = True” to the unionByName function.

Python3

# Python program to illustrate the working
# of unionByName() function with an
# additional argument
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98, "Computer Science"),
     ("Harshit", 80.31, "Information Technology")],
    ["Student Name", "Overall Percentage", "Department"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
  
# Print the result of the union()
res.show()

Output:

+------------+------------------+--------------------+
|Student Name|Overall Percentage|          Department|
+------------+------------------+--------------------+
|   Bhuwanesh|             82.98|    Computer Science|
|     Harshit|             80.31|Information Techn...|
|      Naveen|            91.123|                null|
|      Piyush|             90.51|                null|
+------------+------------------+--------------------+

How to union multiple dataframe in PySpark?

Method 1: Union() function in pyspark

Python3

Python3

Method 2: UnionByName() function in pyspark

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

Google wants to hear your thoughts on the Android 15 QPR2 Beta

Recent Comments

EDITOR PICKS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR POSTS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR CATEGORY

ABOUT US

FOLLOW US