Let us see how to delete several repeated lines from a file using Python’s File Handling power. If the file is small with a few lines, then the task of deleting/eliminating repeated lines from it could be done manually, but when it comes to large files, this is where Python comes to your rescue.
Eliminating repeated lines from a file in Python
Below are the methods that we will cover in this article:
- Using a List
- Using a Set
- Using
Pandas
library
Input File:
For the sake of this example let’s create a file (Lorem_input.txt) with some Ipsum text in it. All the repeated lines are marked in bold.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.
Eliminating Repeated Lines from a File using a List
Open the input file using the open() function and pass in the flag -r to open in reading mode then open an output file, using the -w flag, where we would store the contents of the file after deleting all repeated lines from it. Using the list() method keep track of all the lines seen so far so that we can compare it with the current reading line. Now, iterate over each line of the input file and compare it with the lines seen so far. If the current line is also present in lines seen so far, then skip that line else write that line to the output file, and don’t forget to add the current line to the lines seen so far. Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.
Python3
def remove_duplicates(input_file, output_file): lines_seen = set () with open (output_file, 'w' ) as out_file: with open (input_file, 'r' ) as in_file: for line in in_file: if line not in lines_seen: out_file.write(line) lines_seen.add(line) # Usage input_file = open ( 'C:/Users/user/Desktop/Lorem_input.txt' , "r" ) output_file = open ( 'C:/Users/user/Desktop/Lorem_output.txt' , "w" ) remove_duplicates(input_file, output_file) |
Output file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.
Eliminating Repeated Lines from a File using a Set
Open the input file using the open() function and pass in the flag -r to open in reading mode. Open an output file, using the -w flag, where we would store the contents of the file after deleting all repeated lines from it. Using the set() method keep track of all the lines seen so far so that we can compare it with the current reading line. Now, iterate over each line of the input file and compare it with the lines seen so far. If the current line is also present in lines seen so far, then skip that line else write that line to the output file, and don’t forget to add the current line to the lines seen so far and close the files.
Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.
Python3
# creating the output file outputFile = open ( 'C:/Users/user/Desktop/Lorem_output.txt' , "w" ) # reading the input file inputFile = open ( 'C:/Users/user/Desktop/Lorem_input.txt' , "r" ) # holds lines already seen lines_seen_so_far = set () # iterating each line in the file for line in inputFile: # checking if line is unique if line not in lines_seen_so_far: # write unique lines in output file outputFile.write(line) # adds unique lines to lines_seen_so_far lines_seen_so_far.add(line) # closing the file inputFile.close() outputFile.close() |
Running the above Python script will remove all the repeated lines from the input file and write the modified file to the output file. After running this script the output file(Lorem_output.txt) will look something like this
Output file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.
Eliminating Repeated Lines from a File using Pandas
The function remove_duplicates
reads the data from the input file, represented by the input_file
variable, into a pandas DataFrame.Then it uses the drop_duplicates
method to remove duplicate rows in place. The resulting unique data frame is saved back to a new file specified by the output_file
variable. The correct file paths are defined, and the function is called with the input and output file paths to eliminate repeated lines from the input file and save the unique content to the output file. Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.
Python3
import pandas as pd def remove_duplicates(input_file, output_file): df = pd.read_csv(input_file, header = None ) df.drop_duplicates(inplace = True ) df.to_csv(output_file, header = False , index = False ) # creating the output file outputFile = open ( 'C:/Users/user/Desktop/Lorem_output.txt' , "w" ) # reading the input file inputFile = open ( 'C:/Users/user/Desktop/Lorem_input.txt' , "r" ) remove_duplicates(input_file, output_file) |
Output file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.