In this article, we will discuss how to handle duplicate values in a pyspark dataframe. A dataset may contain repeated rows or repeated data points that are not useful for our task. These repeated values in our dataframe are called duplicate values.
To handle duplicate values, we may use a strategy in which we keep the first occurrence of the values and drop the rest.
dropduplicates(): Pyspark dataframe provides dropduplicates() function that is used to drop duplicate occurrences of data inside a dataframe.
Syntax: dataframe_name.dropDuplicates(Column_name)
The function takes Column names as parameters concerning which the duplicate values have to be removed.
Creating Dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType # Start spark session spark = SparkSession.builder.appName( "Student_Info" ).getOrCreate() # Initialize our data data2 = [( "Pulkit" , 12 , "CS32" , 82 , "Programming" ), ( "Ritika" , 20 , "CS32" , 94 , "Writing" ), ( "Ritika" , 20 , "CS32" , 84 , "Writing" ), ( "Atirikt" , 4 , "BB21" , 58 , "Doctor" ), ( "Atirikt" , 4 , "BB21" , 78 , "Doctor" ), ( "Ghanshyam" , 4 , "DD11" , 38 , "Lawyer" ), ( "Reshav" , 18 , "EE43" , 56 , "Timepass" ) ] # Define schema schema = StructType([ StructField( "Name" , StringType(), True ), StructField( "Roll Number" , IntegerType(), True ), StructField( "Class ID" , StringType(), True ), StructField( "Marks" , IntegerType(), True ), StructField( "Extracurricular" , StringType(), True ) ]) # read the dataframe df = spark.createDataFrame(data = data2, schema = schema) df.show() |
Output:
Examples 1: This example illustrates the working of dropDuplicates() function over a single column parameter. The dataset is custom-built, so we had defined the schema and used spark.createDataFrame() function to create the dataframe.
Python3
# drop duplicates df.dropDuplicates([ 'Roll Number' ]).show() # stop Session spark.stop() |
Output:
From the above observation, it is clear that the rows with duplicate Roll Number were removed and only the first occurrence kept in the dataframe.
Example 2: This example illustrates the working of dropDuplicates() function over multiple column parameters. The dataset is custom-built so we had defined the schema and used spark.createDataFrame() function to create the dataframe.
Python3
# drop duplicates df.dropDuplicates([ 'Roll Number' , "Name" ]).show() # stop the session spark.stop() |
Output:
From the above observation, it is clear that the data points with duplicate Roll Numbers and Names were removed and only the first occurrence kept in the dataframe.
Note: The data having both the parameters as a duplicate was only removed. In the above example, the Column Name of “Ghanshyam” had a Roll Number duplicate value, but the Name was unique, so it was not removed from the dataframe. Thus, the function considers all the parameters not only one of them.