Tuesday, September 24, 2024
Google search engine
HomeLanguagesPython | Pandas dataframe.drop_duplicates()

Python | Pandas dataframe.drop_duplicates()

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

Syntax of df.drop_duplicates()

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

  • subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates. 
  • keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’. 
    • If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
    • If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
    • If False, it consider all of the same values as duplicates
  • inplace: Boolean values, removes rows with duplicates if True.

Return type: DataFrame with removed duplicate rows depending on Arguments passed. 

Example:

As we can see one of the TeamA and team has been dropped due to duplicate value.

Python3




import pandas as pd
  
data = {
    "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"],
    "B": [50, 40, 40, 30, 50],
    "C": [True, False, False, False, True]
}
  
df = pd.DataFrame(data)
  
display(df.drop_duplicates())


Output: 

    A        B    C
0    TeamA    50    True
1    TeamB    40    False
3    TeamC    30    False

To download the CSV file used, Click Here. 

Example 1: Removing rows with the same First Name 

In the following example, rows having the same First Name are removed and a new data frame is returned.

Python3




# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("employees.csv")
  
# sorting by first name
data.sort_values("First Name", inplace=True)
  
# dropping ALL duplicate values
data.drop_duplicates(subset="First Name",
                     keep=False, inplace=True)
  
# displaying data
data


Output: 

As shown in the image, the rows with the same names were removed from a data frame. 
 

 

Example 2: Removing rows with all duplicate values

In this example, rows having all values will be removed. Since the CSV file isn’t having such a row, a random row is duplicated and inserted into the data frame first.

Python3




# length before adding row
length1 = len(data)
  
# manually inserting duplicate of a row of row 440
data.loc[1001] = [data["First Name"][440],
                  data["Gender"][440],
                  data["Start Date"][440],
                  data["Last Login Time"][440],
                  data["Salary"][440],
                  data["Bonus %"][440],
                  data["Senior Management"][440],
                  data["Team"][440]]
  
# length after adding row
length2 = len(data)
  
# sorting by first name
data.sort_values("First Name", inplace=True)
  
# dropping duplicate values
data.drop_duplicates(keep=False, inplace=True)
  
# length after removing duplicates
length3 = len(data)
  
# printing all data frame lengths
print(length1, length2, length3)


Output: 

As shown in the output image, the length after removing duplicates is 999. Since the keep parameter was set to False, all of the duplicate rows were removed.
 

 

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments