Prerequisite: Scrapy, PyPDF2, URLLIB
In this article, we will be using Scrapy to parse any online PDF without downloading it onto the system. To do that we have to use the PDF parser or editor library of Python know as PyPDF2.
PyPDF2 is a pdf parsing library of python, which provides various methods like reader methods, writer methods, and many more which are used to modify, edit and parse the pdfs either online or offline.
All the constructors of PyPDF2 classes require a stream of the PDF file. Now, since we can only achieve the URL of the pdf file, so to convert that URL to a file stream or simply open that URL we will require the use of urllib module of Python which can be used to call an urlopen() method on the request object returned by spider.
Example 1: We will be using some basic operations like extracting the page numbers and checking whether the file is encrypted or not. For this, we will parse with URL and find get the response then we will check the file pages and encrypted using numPages and isEncrypted.
Scrapy spider crawls the web page to find the pdf file online which is to be scrapped, then the URL of that pdf file is obtained from another variable URL, then the urllib is used to open the URL file and create a reader object of PyPDF2 lib by passing the stream link of the URL to the parameter of the Object’s constructor.
Python3
import io import PyPDF2 import urllib.request import scrapy from scrapy.item import Item class ParserspiderSpider(scrapy.Spider): name = 'parserspider' # URL of the pdf file . This is operating system # book solution of author Albert Silberschatz start_urls = ['https: / / codex.cs.yale.edu / avi / \ os - book / OS9 / practice - exer - dir / index.html'] # default parse method def parse( self , response): # getting the list of URL of the pdf pdfs = response.xpath( '//tr[3]/td[2]/a/@href' ) # Extracting the URL URL = response.urljoin(pdfs[ 0 ].extract()) # calling urllib to create a reader of the pdf url File = urllib.request.urlopen(URL) reader = PyPDF2.pdf.PdfFileReader(io.BytesIO( File .read())) # accessing some descriptions of the pdf file. print ( "This is the number of pages" + str (reader.numPages)) print ( "Is file Encrypted?" + str (reader.isEncrypted)) |
Output:
Example 2: In this example, we will be extracting the data of the pdf file (parsing), then the PyPDF2 object is used to make the required changes to the pdf file through the various methods mentioned above. We will print the extracted data to the terminal.
Python3
import io import PyPDF2 import urllib.request import scrapy from scrapy.item import Item class ParserspiderSpider(scrapy.Spider): name = 'parserspider' # URL of the pdf file. start_urls = ['https: / / codex.cs.yale.edu / avi\ / os - book / OS9 / practice - exer - dir / index.html'] # default parse method def parse( self , response): # getting the list of URL of the pdf pdfs = response.xpath( '//tr[3]/td[2]/a/@href' ) # Extracting the URL URL = response.urljoin(pdfs[ 0 ].extract()) # calling urllib to create a reader of the pdf url File = urllib.request.urlopen(URL) reader = PyPDF2.pdf.PdfFileReader(io.BytesIO( File .read())) # creating data data = "" for datas in reader.pages: data + = datas.extractText() print (data) |
Output: