Finding Duplicate Files with Python

27 July 2024

0

In this article, we will code a python script to find duplicate files in the file system or inside a particular folder.

Method 1: Using Filecmp

The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical otherwise False.

Syntax: filecmp.cmp(f1, f2, shallow)

Parameters:

f1: Name of one file
f2: Name of another file to be compared
shallow: With this, we set if we want to compare content or not.

Note: The default value is True which ensures that only the signature of files is compared not content.

Return Type: Boolean value (True if the files are same otherwise False)

Example:

We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.

Python3

# Importing Libraries 
import os 
from pathlib import Path 
from filecmp import cmp
  
  
# list of all documents 
DATA_DIR = Path('/path/to/directory') 
files = sorted(os.listdir(DATA_DIR)) 
  
# List having the classes of documents 
# with the same content 
duplicateFiles = [] 
  
# comparison of the documents 
for file_x in files: 
  
    if_dupl = False
  
    for class_ in duplicateFiles: 
        # Comparing files having same content using cmp() 
        # class_[0] represents a class having same content 
        if_dupl = cmp( 
            DATA_DIR / file_x, 
            DATA_DIR / class_[0], 
            shallow=False
        ) 
        if if_dupl: 
            class_.append(file_x) 
            break
  
    if not if_dupl: 
        duplicateFiles.append([file_x]) 
  
# Print results 
print(duplicateFiles) 

Output:

Method 2: Using Hashing and Dictionary

To start, this script will get a single folder or a list of folders, then through traversing the folder it will find duplicate files. Next, this script will compute a hash for every file present in the folder regardless of their name and are stored in a dictionary manner with hash being the key and path to the file as value.

We have to import os, sys, hashlib libraries.
Then script iterates over the files and calls FindDuplicate() function to find duplicates.

Syntax: FindDuplicate(Path)
Parameter: 
Path: Path to folder having files
Return Type: Dictionary

The function FindDuplicate() takes path to file and calls Hash_File() function
Then Hash_File() function is used to return HEXdigest of that file. For more info on HEXdigest read here.

Syntax: Hash_File(path)
Parameters: 
path: Path of file
Return Type: HEXdigest of file

This MD5 Hash is then appended to a dictionary as key with file path as its value. After this,the FindDuplicate() function returns a dictionary in which keys has multiple values .i.e. duplicate files.
Now Join_Dictionary() function is called which joins the dictionary returned by FindDuplicate() and an empty dictionary.

Syntax: Join_Dictionary(dict1,dict2)
Parameters: 
dict1, dict2: Two different dictionaries
Return Type: Dictionary

After this, we print the list of files having the same content using results.

Example:

We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.

Python3

# Importing Libraries 
import os 
import sys 
from pathlib import Path 
import hashlib 
  
  
def FindDuplicate(SupFolder): 
    
    # Duplic is in format {hash:[names]} 
    Duplic = {} 
    for file_name in files: 
        
        # Path to the file 
        path = os.path.join(folders, file_name) 
          
        # Calculate hash 
        file_hash = Hash_File(path) 
          
        # Add or append the file path to Duplic 
        if file_hash in Duplic: 
            Duplic[file_hash].append(file_name) 
        else: 
            Duplic[file_hash] = [file_name] 
    return Duplic 
  
# Joins dictionaries 
def Join_Dictionary(dict_1, dict_2): 
    for key in dict_2.keys(): 
        
        # Checks for existing key 
        if key in dict_1: 
            
            # If present Append 
            dict_1[key] = dict_1[key] + dict_2[key] 
        else: 
            
            # Otherwise Stores 
            dict_1[key] = dict_2[key] 
  
# Calculates MD5 hash of file 
# Returns HEX digest of file 
def Hash_File(path): 
    
    # Opening file in afile 
    afile = open(path, 'rb') 
    hasher = hashlib.md5() 
    blocksize=65536
    buf = afile.read(blocksize) 
      
    while len(buf) > 0: 
        hasher.update(buf) 
        buf = afile.read(blocksize) 
    afile.close() 
    return hasher.hexdigest() 
  
Duplic = {} 
folders = Path('path/to/directory') 
files = sorted(os.listdir(folders)) 
for i in files: 
    
    # Iterate over the files 
    # Find the duplicated files 
    # Append them to the Duplic 
    Join_Dictionary(Duplic, FindDuplicate(i)) 
      
# Results store a list of Duplic values 
results = list(filter(lambda x: len(x) > 1, Duplic.values())) 
if len(results) > 0: 
    for result in results: 
        for sub_result in result: 
            print('\t\t%s' % sub_result) 
else: 
    print('No duplicates found.') 

Output:

Finding Duplicate Files with Python

Method 1: Using Filecmp

Python3

Method 2: Using Hashing and Dictionary

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Watch the ICC Cricket World Cup From Anywhere in 2023 by Gjurgjica Panova

How to Watch The Voice From Anywhere in 2025 by Eric Goldstein

10 Best VPN Deals in 2025: Verified Coupons & Codes by Katarina Glamoslija

5 Best Password Managers With Local Storage in 2025 by Tyler Cross

Recent Comments

EDITOR PICKS

How to Watch the ICC Cricket World Cup From Anywhere in 2023 by Gjurgjica Panova

How to Watch The Voice From Anywhere in 2025 by Eric Goldstein

10 Best VPN Deals in 2025: Verified Coupons & Codes by Katarina Glamoslija

POPULAR POSTS

How to Watch the ICC Cricket World Cup From Anywhere in 2023 by Gjurgjica Panova

How to Watch The Voice From Anywhere in 2025 by Eric Goldstein

10 Best VPN Deals in 2025: Verified Coupons & Codes by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US