Python PySpark – Union and UnionAll

27 July 2024

4

In this article, we will discuss Union and UnionAll in PySpark in Python.

Union in PySpark

The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other.

Syntax:

dataFrame1.union(dataFrame2)

Here,

dataFrame1 and dataFrame2 are the dataframes

Example 1:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is the same.

Python3

# Python program to illustrate the 
# working of union() function 
  
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate() 
  
# Creating a dataframe 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another dataframe 
data_frame2 = spark.createDataFrame( 
    [("Naveen", 91.123), ("Piyush", 90.51)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# union() 
answer = data_frame1.union(data_frame2) 
  
# Print the result of the union() 
answer.show() 

Output:

Example 2:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as union() function is ideal for datasets having the same structure or schema.

Python3

# Python program to illustrate the 
# working of union() function 
  
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate() 
  
# Creating a data frame 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another data frame 
data_frame2 = spark.createDataFrame( 
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")], 
    ["Overall Percentage", "Student Name"] 
) 
  
# Union both the dataframes using union() function 
answer = data_frame1.union(data_frame2) 
  
# Print the union of both the dataframes 
answer.show() 

Output:

UnionAll() in PySpark

UnionAll() function does the same task as union() function but this function is deprecated since Spark “2.0.0” version. Hence, union() function is recommended.

Syntax:

dataFrame1.unionAll(dataFrame2)

Here,

dataFrame1 and dataFrame2 are the dataframes

Example 1:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is the same.

Python3

# Python program to illustrate the 
# working of unionAll() function 
  
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate() 
  
# Creating a dataframe 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another dataframe 
data_frame2 = spark.createDataFrame( 
    [("Naveen", 91.123), ("Piyush", 90.51)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Union both the dataframes using unionAll() function 
answer = data_frame1.unionAll(data_frame2) 
  
# Print the union of both the dataframes 
answer.show() 

Output:

Example 2:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as unionAll() function is ideal for datasets having the same structure or schema.

Python3

# Python program to illustrate the 
# working of union() function 
  
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('Lazyroar.com').getOrCreate() 
  
# Creating a data frame 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another data frame 
data_frame2 = spark.createDataFrame( 
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")], 
    ["Overall Percentage", "Student Name"] 
) 
  
# Union both the dataframes using unionAll() function 
answer = data_frame1.unionAll(data_frame2) 
  
# Print the union of both the dataframes 
answer.show() 

Output:

Python PySpark – Union and UnionAll

Union in PySpark

Python3

Python3

UnionAll() in PySpark

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US