Apache Tika is a library that allows you to extract data from different documents(.PDF, .DOCX, etc.). In this tutorial, we will extract data by using BodyContentHandler.Next dependency that will be used is shown below:
<dependency> <groupId>org.apache.tika < / groupId > <artifactId>tika - parsers < / artifactId > <version>1.26 < / version > < / dependency >
BodyContentHandler is a class decorator that allows one to get everything inside XHTML <body> tag. <body> or <body/> will not be included into result value.
Let us discuss first various constructors of this class is as follows:
BodyContentHandler() | Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown. |
---|---|
BodyContentHandler(writeLimit) |
Writes all content into an internal string buffer, to get content just call toString(). ‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown. |
BodyContentHandler(OutputStream outputStream) | Writes all content into a given outputStream. Without any content limit. |
BodyContentHandler(Writer writer) | Writes all content into a given writer. Without any content limit. |
BodyContentHandler(ContentHandler handler) | Passes all content to a given handler. |
The methods of this class is as follows:
Method | Action Performed |
---|---|
MatchingContentHandler | Allows you to get data by XPath |
Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.
Implementation:
Example 1: Reading everything into the inner string buffer
Java
// Java Program to Read Everything into Inner String Buffer // Main class public class GFG { // Method 1 // To parse the string public String parseToStringExample(String fileName) throws IOException, TikaException, SAXException { // Creating an object of InputStream class InputStream stream = this .getClass() .getClassLoader() .getResourceAsStream(fileName); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); // Parsing the string parser.parse(stream, handler, metadata, context); return handler.toString(); } // Method 2 // Main driver method public static void main(String[] args) throws TikaException, IOException, SAXException { // Creating object of main class in main method GFG example = new GFG(); // Display message for better readability System.out.println( "Result" ); // Calling the method 1 to parse string by // providing file as an argument System.out.println(example.parseToStringExample( "test-reading.pdf" )); } } |
Output:
Example 2: Writing content into a file with specifying the maximum content length
Java
// Java Program to Write Content into File by // Specifying the Maximum Content Length // Main class // BodyContentHandlerWriteToFileExample public class GFG { // Method 1 // Main driver method public static void main(String[] args) throws TikaException, IOException, SAXException { // Creating an object of the class GFG example = new GFG(); // Calling the Method 2 in main() method and // passing the file and directory path as arguments // to it example.writeParsedDataToFile( "test-reading.pdf" , "/Users/ali_zhagparov/Desktop/pdf-content.txt" ); } // Method 2 // Writing parsed data to a file public void writeParsedDataToFile(String readFromFileName, String writeToFileName) throws IOException, TikaException, SAXException { // Creating an object of InputStream InputStream stream = this .getClass() .getClassLoader() .getResourceAsStream(readFromFileName); // Creating an object of File class File yourFile = new File(writeToFileName); // If file is already existing then // no operations to be performed yourFile.createNewFile(); FileOutputStream fileOutputStream = new FileOutputStream(yourFile, false ); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(fileOutputStream); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(stream, handler, metadata, context); } } |
Output:
There is nothing visible on the console window as there it files directory mapping where in this case it tries to write all information into a file
The program results in a ‘.txt’ with ‘.pdf’ file content which is as follows: