In this article, we are going to use a concept called hashing to identify unique files and delete duplicate files using Python.
Modules required:
- tkinter: We need to make a way for us to select the folder in which we want to do this cleaning process so every time we run the code we should get a file dialog to select a folder and we are going to use the Tkinter library. In this library, we have a method called “askdirectory” which can be used to ask user to choose a directory. To install this library, type the following command in IDE/terminal.
pip install tk
- hashlib: In order to use the md5 hash function we need the hashlib library. To install this library, type the following command in IDE/terminal.
pip install hashlib
- os: This module helps us in removing duplicate files by providing functions for fetching file contents and deleting files, etc. To install this library, type the following command in IDE/terminal. The os module is a part of the standard library within Python.
Approach:
- We will ask the user to select a folder & we will search under this umbrella directory for all the duplicate and redundant files.
- We will take the content of each file & pass it through a hash function which is going to generate a unique string corresponding to a unique file.
- The hash string is going to be a fixed size and the size is going to depend on the type of hash function we are using. We have many hash functions like md5, SHA1 or SHA 256, and many others. In this article, we’ll use the md5 hash and it’s always going to produce a hash value of 32 characters long irrespective of the size of the file & type of the file.
- In order to detect duplicate files and then delete those files, we are going to maintain a python dictionary.
- We are going to pass the hash string of each and every file inside every subfolder of the root directory as keys of dictionary & file paths as values of the dictionary.
- Every time while inserting a new file record we will check if we are getting any duplicate entries in our dictionary. If we find any duplicate file we will take the path of the file and delete that file.
Stepwise Implementation
Step 1: Import Tkinter, os, hashlib & pathlib libraries.
Python3
from tkinter.filedialog import askdirectory from tkinter import Tk import os import hashlib from pathlib import Path |
Step 2: We are using tk.withdraw because we don’t want the GUI window of tkinter to be appearing on our screen we only want the file dialog for selecting the folder. askdirectory(title=”Select a folder”) this line of code pop ups a dialog box on the screen through which we can select a folder.
Python3
Tk().withdraw() file_path = askdirectory(title = "Select a folder" ) |
Step 3: Next we need to list out all the files inside our root folder. To do that we need OS module, os.walk() takes the path of our root folder as an argument and it will walk through each subdirectory of the folder given to it and it will list out all the files. This function returns a list of tuples with three elements. The first element is the path to that folder and the second element is all the subfolders inside that folder and the third element is list of all the files inside that folder.
Python3
list_of_files = os.walk(file_path) |
Step 4: Our final goal is to list out all the files in each and every subdirectory and the main directory that’s why we are running a for loop on all the files. We need to open up each and every file and convert it into a hash string in order to do that we will define a variable called hash_file. md5 hash function will convert all the content of our file into md5 hash. In order to open a file we need to first have the path to it so here we are using another function in os module called os.path.join(). So we’ll say open the file using file path in read mode. This will convert our file into a md5 hash. In order to get the hash string we are going to use hexdigest() method.
Python3
for root, folders, files in list_of_files: for file in files: file_path = Path(os.path.join(root, file )) Hash_file = hashlib.md5( open ( file_path, 'rb' ).read()).hexdigest() |
Step 5: In order to detect the duplicate files we are going to define an empty dictionary. We will add elements to this dictionary and the key of each element is going to be file hash and the value is going to be the file path. If file hash has already been added to this unique files dictionary that means that we have found a duplicate file and we need to delete that file so we’ll simply delete that file using os.remove() function. If it’s not there then we are going to add it to that dictionary.
Python3
unique_files = dict () if Hash_file not in unique_files: unique_files[Hash_file] = file_path else : os.remove(file_path) print (f "{file_path} has been deleted" ) |
Below is the full implementation:
Python3
from tkinter.filedialog import askdirectory # Importing required libraries. from tkinter import Tk import os import hashlib from pathlib import Path # We don't want the GUI window of # tkinter to be appearing on our screen Tk().withdraw() # Dialog box for selecting a folder. file_path = askdirectory(title = "Select a folder" ) # Listing out all the files # inside our root folder. list_of_files = os.walk(file_path) # In order to detect the duplicate # files we are going to define an empty dictionary. unique_files = dict () for root, folders, files in list_of_files: # Running a for loop on all the files for file in files: # Finding complete file path file_path = Path(os.path.join(root, file )) # Converting all the content of # our file into md5 hash. Hash_file = hashlib.md5( open (file_path, 'rb' ).read()).hexdigest() # If file hash has already # # been added we'll simply delete that file if Hash_file not in unique_files: unique_files[Hash_file] = file_path else : os.remove(file_path) print (f "{file_path} has been deleted" ) |
Output: