In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:
- Using PyPDF2
- Using pdfx
Method 1: Using PyPDF2.
PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.
Approach:
- Read the PDF file and convert it into text
- Get URL from text Using Regular Expression
Let’s Implement this module step-wise:
Step 1: Open and Read the PDF file.
Python3
import PyPDF2 file = "Enter PDF File Name" pdfFileObject = open ( file , 'rb' ) pdfReader = PyPDF2.PdfFileReader(pdfFileObject) for page_number in range (pdfReader.numPages): pageObject = pdfReader.getPage(page_number) pdf_text = pageObject.extractText() print (pdf_text) pdfFileObject.close() |
Output:
Step 2: Use Regular Expression to find URL from String
Python3
# Import Module import PyPDF2 import re # Enter File Name file = "Enter PDF File Name" # Open File file pdfFileObject = open ( file , 'rb' ) pdfReader = PyPDF2.PdfFileReader(pdfFileObject) # Regular Expression (Get URL from String) def Find(string): # findall() has been used # with valid conditions for urls in string regex = r "(https?://\S+)" url = re.findall(regex,string) return [x for x in url] # Iterate through all pages for page_number in range (pdfReader.numPages): pageObject = pdfReader.getPage(page_number) # Extract text from page pdf_text = pageObject.extractText() # Print all URL print (Find(pdf_text)) # CLose the PDF pdfFileObject.close() |
Output:
['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']
Method 2: Using pdfx.
In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.
pip install pdfx
Below is the implementation:
Python3
# Import Module import pdfx # Read PDF File pdf = pdfx.PDFx( "File Name" ) # Get list of URL print (pdf.get_references_as_dict()) |
Output:-
{'url': ['https://www.geeksforgeeks.org/', 'https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'Lazyroar.org']}