Saturday, September 21, 2024
Google search engine
HomeLanguagesExtract hyperlinks from PDF in Python

Extract hyperlinks from PDF in Python

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

  • Using PyPDF2
  • Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.


  • Read the PDF file and convert it into text
  • Get URL from text Using Regular Expression

Let’s Implement this module step-wise:

Step 1: Open and Read the PDF file.


import PyPDF2
file = "Enter PDF File Name"
pdfFileObject = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
for page_number in range(pdfReader.numPages):
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()



Step 2: Use Regular Expression to find URL from String


# Import Module
import PyPDF2
import re
# Enter File Name
file = "Enter PDF File Name"
# Open File file
pdfFileObject = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
# Regular Expression (Get URL from String)
def Find(string):
    # findall() has been used
    # with valid conditions for urls in string
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url]
# Iterate through all pages
for page_number in range(pdfReader.numPages):
    pageObject = pdfReader.getPage(page_number)
    # Extract text from page
    pdf_text = pageObject.extractText()
    # Print all URL
# CLose the PDF


['', '', '']

Method 2: Using pdfx. 

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:


# Import Module
import pdfx
# Read PDF File
pdf = pdfx.PDFx("File Name")
# Get list of URL


{'url': ['',


Most Popular

Recent Comments