In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python.
Let’s create a sample Dataframe
Python3
# importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "1" , "sravan" , "company 1" ], [ "4" , "sridevi" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company' ] # creating a dataframe from the # lists of data dataframe = spark.createDataFrame(data, columns) print ( 'Actual data in dataframe' ) dataframe.show() |
Output:
Method 1: Distinct
Distinct data means unique data. It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
print ( 'distinct data after dropping duplicate rows' ) # display distinct data dataframe.distinct().show() |
Output:
We can use the select() function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
Python3
# display distinct data in Employee # ID and Employee NAME dataframe.select([ 'Employee ID' , 'Employee NAME' ]).distinct().show() |
Output:
Method 2: dropDuplicate
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
# remove duplicate data using # dropDuplicates()function dataframe.dropDuplicates().show() |
Output:
Python program to remove duplicate values in specific columns
Python3
# remove duplicate data using # dropDuplicates() function in # two columns dataframe.select([ 'Employee ID' , 'Employee NAME' ] ).dropDuplicates().show() |
Output: