In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python.
Let’s create the first dataframe:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" )] # specify column names columns = [ 'ID1' , 'NAME1' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output:
Let’s create the second dataframe:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" ), ( 4 , "rohith" ), ( 5 , "gnanesh" )] # specify column names columns = [ 'ID2' , 'NAME2' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) dataframe1.show() |
Output:
we can join the multiple columns by using join() function using conditional operator
Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2))
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- column1 is the first matching column in both the dataframes
- column2 is the second matching column in both the dataframes
Example 1: PySpark code to join the two dataframes with multiple columns (id and name)
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" )] # specify column names columns = [ 'ID1' , 'NAME1' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" ), ( 4 , "rohith" ), ( 5 , "gnanesh" )] # specify column names columns = [ 'ID2' , 'NAME2' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) # join based on ID and name column dataframe.join(dataframe1, (dataframe.ID1 = = dataframe1.ID2) & (dataframe.NAME1 = = dataframe1.NAME2)).show() |
Output:
Example 2: Join with or operator
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" )] # specify column names columns = [ 'ID1' , 'NAME1' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # list of employee data data = [( 1 , "sravan" ), ( 2 , "ojsawi" ), ( 3 , "bobby" ), ( 4 , "rohith" ), ( 5 , "gnanesh" )] # specify column names columns = [ 'ID2' , 'NAME2' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) # join based on ID and name column dataframe.join(dataframe1, (dataframe.ID1 = = dataframe1.ID2) | (dataframe.NAME1 = = dataframe1.NAME2)).show() |
Output: