Drop duplicate rows in PySpark DataFrame

27 July 2024

1

In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python.

Let’s create a sample Dataframe

Python3

# importing module
import pyspark
 
# importing sparksession from 
# pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving 
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 1"], 
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"], 
        ["1", "sravan", "company 1"], 
        ["4", "sridevi", "company 1"]]
 
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company']
 
# creating a dataframe from the 
# lists of data
dataframe = spark.createDataFrame(data, columns)
 
print('Actual data in dataframe')
dataframe.show()

Output:

Method 1: Distinct

Distinct data means unique data. It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

where, dataframe is the dataframe name created from the nested lists using pyspark

Python3

print('distinct data after dropping duplicate rows')
 
# display distinct data
dataframe.distinct().show()

Output:

We can use the select() function along with distinct function to get distinct values from particular columns

Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()

Python3

# display distinct data in Employee
# ID and Employee NAME
dataframe.select(['Employee ID', 'Employee NAME']).distinct().show()

Output:

Method 2: dropDuplicate

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Python3

# remove duplicate data using 
# dropDuplicates()function
dataframe.dropDuplicates().show()

Output:

Python program to remove duplicate values in specific columns

Python3

# remove duplicate data using 
# dropDuplicates() function in 
# two columns
dataframe.select(['Employee ID', 'Employee NAME']
                ).dropDuplicates().show()

Output:

Drop duplicate rows in PySpark DataFrame

Python3

Method 1: Distinct

Python3

Python3

Method 2: dropDuplicate

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

How To Install PHP 8.2 on Ubuntu 22.04|20.04|18.04

Recent Comments

EDITOR PICKS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR POSTS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR CATEGORY

ABOUT US

FOLLOW US