Extract text from PDF File using Python

28 July 2024

2

All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

We will extract text from pdf files using two Python libraries, PyPDF and PyMuPDF, in this article.

Extracting text from a PDF file using the PyPDF library.

Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python

Installation

To install this package type the below command in the terminal.

pip install PyPDF2

Example: Input PDF:

Python3

# importing required modules
from PyPDF2 import PdfReader
  
# creating a pdf reader object
reader = PdfReader('example.pdf')
  
# printing number of pages in pdf file
print(len(reader.pages))
  
# getting a specific page from the pdf file
page = reader.pages[0]
  
# extracting text from page
text = page.extract_text()
print(text)

Output:

Let us try to understand the above code in chunks:

reader = PdfReader('example.pdf')

We created an object of PdfReader class from the PyPDF2 module.
The PdfReader class takes a required positional argument of the path to the pdf file.

print(len(reader.pages))

pages property gives a List of PageObjects. So, here we can use the in-built len() function of python to get the number of pages in the pdf file.

page = reader.pages[0]

Now, as reader.pages is a list of PageObjects, we can get a specific Page of the pdf by tapping into the index of the page. In python list indexing starts from 0, so reader.pages[0] gives us the first page of the pdf file.

text = page.extract_text()
print(text)

Page object has function extract_text() to extract text from the pdf page.

Extracting text from a PDF file using the PyMuPDF library.

PyMuPDF is a Python library that supports file formats like XPS, PDF, CBR, and CBZ. But for now, in this article, we are going to concentrate on PDF (Portable Document Format) files.

Installation

Python3

pip install PyMuPDF==1.16.14

While I was writing this article, 1.16.14 was the current version of the PyMuPDF library.

To extract the text from the pdf, we need to follow the following steps:

Importing the library
Opening document
Extracting text

Note: We are using the sample.pdf here; to get the pdf, use the link below.

There is a tool called UPDF that can be used to extract text from PDF file.

https://www.africau.edu/images/default/sample.pdf – sample.pdf

1. Importing the library

Python3

import fitz

2. Opening document

Python3

doc = fitz.open('sample.pdf')

Here we created an object called “doc,” and filename should be a Python string.

3. Extracting text

Python3

for page in doc:
  text = page.get_text()
  print(text)

Here, we iterated pages in pdf and used the get_text() method to extract each page from the file.

All the Code to extract the text

Python3

import fitz
doc = fitz.open('sample.pdf')
text = ""
for page in doc:
   text+=page.get_text()
print(text)

Output:

Conclusion

We have seen two Python libraries, PyPDF and PyMuPDF, that can extract text from a PDF file. Comment on your preferred library from the above two libraries.

Extract text from PDF File using Python

Extracting text from a PDF file using the PyPDF library.

Installation

Python3

Extracting text from a PDF file using the PyMuPDF library.

Installation

Python3

Python3

Python3

Python3

Python3

Conclusion

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US