Thursday, December 26, 2024
Google search engine
HomeLanguagesFind duplicate rows in a Dataframe based on all or selected columns

Find duplicate rows in a Dataframe based on all or selected columns

In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.
 

Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)
Parameters: 
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.
keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’. 
 

  • If ‘first’, This considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, This considers last value as unique and rest of the same values as duplicate.
  • If ‘False’, This considers all of the same values as duplicates.

Returns: Boolean Series denoting duplicate rows. 
 

Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’. 
 

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
 
# Print the Dataframe
df


Output : 
 

dataframe

Example 1: Select duplicate rows based on all columns. 
Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
 

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output : 
 

Duplcate rows

Example 2: Select duplicate rows based on all columns. 
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
 

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows except last
# occurrence based on all columns.
duplicate = df[df.duplicated(keep = 'last')]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output : 
 

Duplcate rows-2

Example 3: If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument. 
 

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
 
print("Duplicate Rows based on City :")
 
# Print the resultant Dataframe
duplicate


Output : 
 

Duplcate rows-3

Example 4: Select duplicate rows based on more than one column name.
 

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object 
df = pd.DataFrame(employees,
                   columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
 
print("Duplicate Rows based on Name and Age :")
 
# Print the resultant Dataframe
duplicate


Output : 
 

Duplcate rows-4

 

RELATED ARTICLES

Most Popular

Recent Comments