Friday, December 27, 2024
Google search engine
HomeLanguagesExtract hyperlinks from PDF in Python

Extract hyperlinks from PDF in Python

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

  • Using PyPDF2
  • Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.

Approach:

  • Read the PDF file and convert it into text
  • Get URL from text Using Regular Expression

Let’s Implement this module step-wise:

Step 1: Open and Read the PDF file.

Python3




import PyPDF2
 
 
file = "Enter PDF File Name"
 
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
     
pdfFileObject.close()


Output:

 

Step 2: Use Regular Expression to find URL from String

Python3




# Import Module
import PyPDF2
import re
 
# Enter File Name
file = "Enter PDF File Name"
 
# Open File file
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
# Regular Expression (Get URL from String)
def Find(string):
   
    # findall() has been used
    # with valid conditions for urls in string
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url]
   
# Iterate through all pages
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
     
    # Extract text from page
    pdf_text = pageObject.extractText()
     
    # Print all URL
    print(Find(pdf_text))
     
# CLose the PDF
pdfFileObject.close()


Output:

['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

Method 2: Using pdfx. 

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:

Python3




# Import Module
import pdfx
 
# Read PDF File
pdf = pdfx.PDFx("File Name")
 
# Get list of URL
print(pdf.get_references_as_dict())


Output:-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'Lazyroar.org']}

RELATED ARTICLES

Most Popular

Recent Comments