In this article, we are going to see how to concatenate two pyspark dataframe using Python.
Creating Dataframe for demonstration:
Python3
# Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName( 'pyspark - example join' ).getOrCreate() # Create data in dataframe data = [(( 'Ram' ), '1991-04-01' , 'M' , 3000 ), (( 'Mike' ), '2000-05-19' , 'M' , 4000 ), (( 'Rohini' ), '1978-09-05' , 'M' , 4000 ), (( 'Maria' ), '1967-12-01' , 'F' , 4000 ), (( 'Jenis' ), '1980-02-17' , 'F' , 1200 )] # Column names in dataframe columns = [ "Name" , "DOB" , "Gender" , "salary" ] # Create the spark dataframe df1 = spark.createDataFrame(data = data, schema = columns) # Print the dataframe df1.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Creating Second dataframe for demonstration:
Python3
# Create data in dataframe data2 = [(( 'Mohi' ), '1991-04-01' , 'M' , 3000 ), (( 'Ani' ), '2000-05-19' , 'F' , 4300 ), (( 'Shipta' ), '1978-09-05' , 'F' , 4200 ), (( 'Jessy' ), '1967-12-01' , 'F' , 4010 ), (( 'kanne' ), '1980-02-17' , 'F' , 1200 )] # Column names in dataframe columns = [ "Name" , "DOB" , "Gender" , "salary" ] # Create the spark dataframe df2 = spark.createDataFrame(data = data, schema = columns) # Print the dataframe df2.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Method 1: Using Union()
Union() methods of the DataFrame are employed to mix two DataFrame’s of an equivalent structure/schema.
Syntax: dataframe_1.union(dataframe_2)
where,
- dataframe_1 is the first dataframe
- dataframe_2 is the second dataframe
Example:
Python3
# union the above created dataframes result = df1.union(df2) # display result.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Method 2: Using unionByName()
In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe
Syntax: dataframe_1.unionByName(dataframe_2)
where,
- dataframe_1 is the first dataframe
- dataframe_2 is the second dataframe
Example:
Python3
# union the two dataftames by using unionByname result1 = df1.unionByName(df2) # display result1.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Method 3: Using functools
Functools module provides functions for working with other functions and callable objects to use or extend them without completely rewriting them.
Syntax:
functools.reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)
where,
- df1 is the first dataframe
- df2 is the second dataframe
We create dataframes with columns ‘a’ and ‘b’ of some random values and pass these three dataframes to our above-created method unionAll() and obtain the resultant dataframe as output and show the result.
Example:
Python3
import functools # explicit function def unionAll(dfs): return functools. reduce ( lambda df1, df2: df1.union( df2.select(df1.columns)), dfs) # unionAll result3 = unionAll([df1, df2]) result3.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+