Saturday, November 16, 2024
Google search engine
HomeLanguagesConcatenate two PySpark dataframes

Concatenate two PySpark dataframes

In this article, we are going to see how to concatenate two pyspark dataframe using Python.

Creating Dataframe for demonstration:

Python3




# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df1.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Creating Second dataframe for demonstration:

Python3




# Create data in dataframe
data2 = [(('Mohi'), '1991-04-01', 'M', 3000),
         (('Ani'), '2000-05-19', 'F', 4300),
         (('Shipta'), '1978-09-05', 'F', 4200),
         (('Jessy'), '1967-12-01', 'F', 4010),
         (('kanne'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df2 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df2.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 1: Using Union()

Union() methods of the DataFrame are employed to mix two DataFrame’s of an equivalent structure/schema.

Syntax: dataframe_1.union(dataframe_2)

where,

  1. dataframe_1 is the first dataframe
  2. dataframe_2 is the second dataframe

Example:

Python3




# union the above created dataframes
result = df1.union(df2)
  
# display
result.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 2: Using unionByName()

 In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe

Syntax: dataframe_1.unionByName(dataframe_2)

where,

  1. dataframe_1 is the first dataframe
  2. dataframe_2 is the second dataframe

Example:

Python3




# union the two dataftames by using unionByname
result1 = df1.unionByName(df2)
  
# display
result1.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 3: Using functools

Functools module provides functions for working with other functions and callable objects to use or extend them without completely rewriting them.

Syntax:

functools.reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)

where,

  • df1 is the first dataframe
  • df2 is the second dataframe

We create dataframes with columns ‘a’ and ‘b’ of some random values and pass these three dataframes to our above-created method unionAll() and obtain the resultant dataframe as output and show the result.

Example:

Python3




import functools
# explicit function
  
  
def unionAll(dfs):
    return functools.reduce(lambda df1, df2: df1.union(
      df2.select(df1.columns)), dfs)
  
  
# unionAll
result3 = unionAll([df1, df2])
result3.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

RELATED ARTICLES

Most Popular

Recent Comments