Apache Tika is a library that is used for document type detection and content extraction from various file formats. Using this, one can develop a universal type detector and content extractor to extract both structured text and metadata from different types of documents such as spreadsheets, text documents, images, PDF’s, and even multimedia input formats to a certain extent. Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language.
Installation:
To install Tika type the below command in the terminal.
pip install tika
Note: Tika is written in Java, so you need a java(7 or 7+) runtime installed
For extracting contents from the PDF files we will use from_file() method of parser object. So let’s see the description first.
Syntax: parser.from_file(filename, additional)
Parameters:
- filename: This is location of file, it is opened in rb mode i.e. read binary mode
- additional: param service: service requested from the tika server, Default value is ‘all’, which results in recursive text content+metadata.
- ‘meta’ returns only metadata. ‘text’ returns only content.
- param xmlContent: You can have XML content, default value- False
Return type: dictionary.
Now, Let’s see the python program for Extracting pdf’s data:
Example 1: Extracting contents of the pdf file.
Python3
# import parser object from tike from tika import parser # opening pdf file parsed_pdf = parser.from_file( "sample.pdf" ) # saving content of pdf # you can also bring text only, by parsed_pdf['text'] # parsed_pdf['content'] returns string data = parsed_pdf[ 'content' ] # Printing of content print (data) # <class 'str'> print ( type (data)) |
Output:
Example 2: Extracting Meta-Data of pdf file.
Python3
# import parser object from tike from tika import parser parsed_pdf = parser.from_file( "sample.pdf" ) # ['metadata'] attribute returns # key-value pairs of meta-data print (parsed_pdf[ 'metadata' ]) # <class 'dict'> print ( type (parsed_pdf[ 'metadata' ])) |
Output:
Example 3: Extract keys.
Python3
from tika import parser parsed_pdf = parser.from_file( "sample.pdf" ) # Returns keys applicable for given pdf. print (parsed_pdf.keys()) |
Output:
Example 4: Know the tika server status.
Python3
from tika import parser # You can also know the # status returned from tika # server, 200 for success parsed_pdf = parser.from_file( "sample.pdf" ) print (parsed_pdf[ 'status' ], type (parsed_pdf[ 'status' ] )) |
Output:
200 <class 'int'>