Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
Syntax of df.drop_duplicates()
Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
- subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
- keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
- If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
- If False, it consider all of the same values as duplicates
- inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.
Example:
As we can see one of the TeamA and team has been dropped due to duplicate value.
Python3
import pandas as pd data = { "A" : [ "TeamA" , "TeamB" , "TeamB" , "TeamC" , "TeamA" ], "B" : [ 50 , 40 , 40 , 30 , 50 ], "C" : [ True , False , False , False , True ] } df = pd.DataFrame(data) display(df.drop_duplicates()) |
Output:
A B C 0 TeamA 50 True 1 TeamB 40 False 3 TeamC 30 False
To download the CSV file used, Click Here.
Example 1: Removing rows with the same First Name
In the following example, rows having the same First Name are removed and a new data frame is returned.
Python3
# importing pandas package import pandas as pd # making data frame from csv file data = pd.read_csv( "employees.csv" ) # sorting by first name data.sort_values( "First Name" , inplace = True ) # dropping ALL duplicate values data.drop_duplicates(subset = "First Name" , keep = False , inplace = True ) # displaying data data |
Output:
As shown in the image, the rows with the same names were removed from a data frame.
Example 2: Removing rows with all duplicate values
In this example, rows having all values will be removed. Since the CSV file isn’t having such a row, a random row is duplicated and inserted into the data frame first.
Python3
# length before adding row length1 = len (data) # manually inserting duplicate of a row of row 440 data.loc[ 1001 ] = [data[ "First Name" ][ 440 ], data[ "Gender" ][ 440 ], data[ "Start Date" ][ 440 ], data[ "Last Login Time" ][ 440 ], data[ "Salary" ][ 440 ], data[ "Bonus %" ][ 440 ], data[ "Senior Management" ][ 440 ], data[ "Team" ][ 440 ]] # length after adding row length2 = len (data) # sorting by first name data.sort_values( "First Name" , inplace = True ) # dropping duplicate values data.drop_duplicates(keep = False , inplace = True ) # length after removing duplicates length3 = len (data) # printing all data frame lengths print (length1, length2, length3) |
Output:
As shown in the output image, the length after removing duplicates is 999. Since the keep parameter was set to False, all of the duplicate rows were removed.