Removing duplicate rows based on specific column in PySpark DataFrame

27 July 2024

1

In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method:

Syntax: dataframe.dropDuplicates([‘column 1′,’column 2′,’column n’]).show()

where,

dataframe is the input dataframe and column name is the specific column

show() method is used to display the dataframe

Let’s create the dataframe.

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql 
# module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of students  data 
data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"], 
        ["3", "rohith", "vvit"], ["4", "sridevi", "vignan"],  
        ["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]] 
  
# specify column names 
columns = ['student ID', 'student NAME', 'college'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
print('Actual data in dataframe') 
dataframe.show() 

Output:

Dropping based on one column

Python3

# remove duplicate rows based on college  
# column 
dataframe.dropDuplicates(['college']).show() 

Output:

Dropping based on multiple columns

Python3

# remove duplicate rows based on college  
# and ID column 
dataframe.dropDuplicates(['college', 'student ID']).show() 

Output:

Removing duplicate rows based on specific column in PySpark DataFrame

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

How to Block a Website on Google Chrome: 2025 Guide by Kristel van Hoof

Recent Comments

EDITOR PICKS

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

POPULAR POSTS

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

POPULAR CATEGORY

ABOUT US

FOLLOW US