In this article, we are going to drop the duplicate data from dataframe using pyspark in Python
Before starting we are going to create Dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "1" , "sravan" , "company 1" ], [ "4" , "sridevi" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data,columns) print ( 'Actual data in dataframe' ) dataframe.show() |
Output:
Method 1: Using distinct() method
It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
Where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to drop duplicate data using distinct() function
Python3
print ( 'distinct data after dropping duplicate rows' ) # display distinct data dataframe.distinct().show() |
Output:
Example 2: Python program to select distinct data in only two columns.
We can use select () function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
Python3
# display distinct data in # Employee ID and Employee NAME dataframe.select([ 'Employee ID' , 'Employee NAME' ]).distinct().show() |
Output:
Method 2: Using dropDuplicates() method
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to remove duplicate data from the employee table.
Python3
# remove duplicate data # using dropDuplicates()function dataframe.dropDuplicates().show() |
Output:
Example 2: Python program to remove duplicate values in specific columns
Python3
# remove duplicate data # using dropDuplicates()function # in two columns dataframe.select([ 'Employee ID' , 'Employee NAME' ]).dropDuplicates().show() |
Output: