Extract hyperlinks from PDF in Python

By Nango Kala

26 July 2024

0

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

Using PyPDF2
Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.

Approach:

Read the PDF file and convert it into text
Get URL from text Using Regular Expression

Let’s Implement this module step-wise:

Step 1: Open and Read the PDF file.

Python3

import PyPDF2
 
file = "Enter PDF File Name"
 
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
     
pdfFileObject.close()

Output:

Step 2: Use Regular Expression to find URL from String

Python3

# Import Module
import PyPDF2
import re 
 
# Enter File Name
file = "Enter PDF File Name"
 
# Open File file
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
# Regular Expression (Get URL from String)
def Find(string): 
   
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
   
# Iterate through all pages
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
     
    # Extract text from page
    pdf_text = pageObject.extractText()
     
    # Print all URL
    print(Find(pdf_text))
     
# CLose the PDF 
pdfFileObject.close()

Output:

['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

Method 2: Using pdfx.

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:

Python3

# Import Module
import pdfx 
 
# Read PDF File
pdf = pdfx.PDFx("File Name") 
 
# Get list of URL
print(pdf.get_references_as_dict())

Output:-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'Lazyroar.org']}

Extract hyperlinks from PDF in Python

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US