Java class< file using the Apache Tika library is used. For document type detection and content extraction from various file formats, it uses various document parsers and document type detection techniques to detect and extract data. It provides a single generic API for parsing different file formats. All these parser libraries are encapsulated in a single interface called the Parser interface. Download Tika Foremost.
BodyContentHandler is an in-built class that creates a handler for the text, which writes these XHTML body character events and stores them in an internal string buffer. It is inherited from the parent class ContentHandlerDecorator in Java. The specified text can be retrieved using the method ContentHandlerDecorator.toString() provided by the parent class. ParseContext class is a component of the Java package org.apache.tika.parser, which is used to parse context and pass it on to the Tika parsers. TXTParser is an in-built package that provides a class TXTParser, which parses the contents of text documents. It extracts the contents of a text Document stored within paragraphs, strings, and tables (without invoking tabular boundaries). It can be used to parse encrypted documents too if the password is specified as an argument.
Java supports multiple in-built classes and packages to extract and access the content from a PDF document. The following classes are used in the extraction of the content:
Procedure:
- Create a content handler.
- Create a TXT file at the local directory in the system.
- Now, create a FileInputStream having the same path as that of the above txt file created.
- Create a content parser using a metadata-type object for the document.
- The document is now parsed using the TXT parser class.
- Print the content of the TXT file as created above to illustrate the extraction of content in the above document.
Example:
Java
// Java Program to Extract Content from a TXT document // Importing java input/output classes import java.io.File; import java.io.FileInputStream; import java.io.IOException; // Importing Apache POI classes import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.txt.TXTParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; // Main Class public class GFG { // Main driver method public static void main(String[] args) throws Exception { // Creating a content handler by // creating an object of BodyContentHandler class BodyContentHandler handler = new BodyContentHandler(); // Creating a file in local directory // Create a file input stream // on specified path with the created file FileInputStream fstream = new FileInputStream( new File( "C:/test.txt" )); // Creating an object of type Metadata to use Metadata metadata = new Metadata(); // Create a context parser for the text document by // creating an object of ParseContext class ParseContext pcontext = new ParseContext(); // Noe, text document can be parsed // using the TXTparser class TXTParser TexTParser = new TXTParser(); // Method parse invoked on TXTParser class TexTParser.parse(fstream, handler, metadata, pcontext); // Print and display the extracted content from TXT // file System.out.println( "Extracting contents :" + contenthandler.toString()); } } |
Output: It is returned as a file which is as follows: