Concatenate two PySpark dataframes

28 July 2024

0

In this article, we are going to see how to concatenate two pyspark dataframe using Python.

Creating Dataframe for demonstration:

Python3

# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df1.show()

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Creating Second dataframe for demonstration:

Python3

# Create data in dataframe
data2 = [(('Mohi'), '1991-04-01', 'M', 3000),
         (('Ani'), '2000-05-19', 'F', 4300),
         (('Shipta'), '1978-09-05', 'F', 4200),
         (('Jessy'), '1967-12-01', 'F', 4010),
         (('kanne'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df2 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df2.show()

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 1: Using Union()

Union() methods of the DataFrame are employed to mix two DataFrame’s of an equivalent structure/schema.

Syntax: dataframe_1.union(dataframe_2)

where,

dataframe_1 is the first dataframe

dataframe_2 is the second dataframe

Example:

Python3

# union the above created dataframes
result = df1.union(df2)
  
# display
result.show()

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 2: Using unionByName()

In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe

Syntax: dataframe_1.unionByName(dataframe_2)

where,

dataframe_1 is the first dataframe

dataframe_2 is the second dataframe

Example:

Python3

# union the two dataftames by using unionByname
result1 = df1.unionByName(df2)
  
# display
result1.show()

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 3: Using functools

Functools module provides functions for working with other functions and callable objects to use or extend them without completely rewriting them.

Syntax:

functools.reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)

where,

df1 is the first dataframe

df2 is the second dataframe

We create dataframes with columns ‘a’ and ‘b’ of some random values and pass these three dataframes to our above-created method unionAll() and obtain the resultant dataframe as output and show the result.

Example:

Python3

import functools
# explicit function
  
  
def unionAll(dfs):
    return functools.reduce(lambda df1, df2: df1.union(
      df2.select(df1.columns)), dfs)
  
  
# unionAll
result3 = unionAll([df1, df2])
result3.show()

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Concatenate two PySpark dataframes

Python3

Python3

Method 1: Using Union()

Python3

Method 2: Using unionByName()

Python3

Method 3: Using functools

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Unban Someone on Twitch in 2025: Quick & Easy by Penka Hristovska

Best VPNs for WhatsApp in 2025: Fast and Reliable by Danica Djokic

How to Change Facebook Location in 2025 + Marketplace & Dating by Danica Djokic

Is Akinator Safe for Kids? What Parents Need to Know in 2025 by Penka Hristovska

Recent Comments

EDITOR PICKS

How to Unban Someone on Twitch in 2025: Quick & Easy by Penka Hristovska

Best VPNs for WhatsApp in 2025: Fast and Reliable by Danica Djokic

How to Change Facebook Location in 2025 + Marketplace & Dating by Danica Djokic

POPULAR POSTS

How to Unban Someone on Twitch in 2025: Quick & Easy by Penka Hristovska

Best VPNs for WhatsApp in 2025: Fast and Reliable by Danica Djokic

How to Change Facebook Location in 2025 + Marketplace & Dating by Danica Djokic

POPULAR CATEGORY

ABOUT US

FOLLOW US