Apache Tika is a library that allows you to extract data from different documents(.PDF, .DOCX, etc.). In this tutorial, we will extract data by using BodyContentHandler.Next dependency that will be used is shown below:
<dependency> <groupId>org.apache.tika < / groupId > <artifactId>tika - parsers < / artifactId > <version>1.26 < / version > < / dependency >
BodyContentHandler is a class decorator that allows one to get everything inside XHTML <body> tag. <body> or <body/> will not be included into result value.
Let us discuss first various constructors of this class is as follows:Â
BodyContentHandler() | Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown. |
---|---|
BodyContentHandler(writeLimit) |
Writes all content into an internal string buffer, to get content just call toString(). ‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown. |
BodyContentHandler(OutputStream outputStream) | Writes all content into a given outputStream. Without any content limit. |
BodyContentHandler(Writer writer) | Writes all content into a given writer. Without any content limit. |
BodyContentHandler(ContentHandler handler) | Passes all content to a given handler. |
The methods of this class is as follows:
Method | Action Performed |
---|---|
MatchingContentHandler | Allows you to get data by XPath |
Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.
Implementation:
Example 1: Reading everything into the inner string buffer
Java
// Java Program to Read Everything into Inner String Buffer Â
// Main class public class GFG { Â
    // Method 1     // To parse the string     public String parseToStringExample(String fileName)         throws IOException, TikaException, SAXException     { Â
        // Creating an object of InputStream class         InputStream stream             = this .getClass()                   .getClassLoader()                   .getResourceAsStream(fileName); Â
        Parser parser = new AutoDetectParser();         ContentHandler handler = new BodyContentHandler();         Metadata metadata = new Metadata();         ParseContext context = new ParseContext(); Â
        // Parsing the string         parser.parse(stream, handler, metadata, context); Â
        return handler.toString();     } Â
    // Method 2     // Main driver method     public static void main(String[] args)         throws TikaException, IOException, SAXException     { Â
        // Creating object of main class in main method         GFG example = new GFG(); Â
        // Display message for better readability         System.out.println( "Result" ); Â
        // Calling the method 1 to parse string by         // providing file as an argument         System.out.println(example.parseToStringExample(             "test-reading.pdf" ));     } } |
Â
Output:Â
Â
Â
Example 2: Writing content into a file with specifying the maximum content length
Â
Java
// Java Program to Write Content into File by // Specifying the Maximum Content Length Â
// Main class // BodyContentHandlerWriteToFileExample public class GFG { Â
    // Method 1     // Main driver method     public static void main(String[] args)         throws TikaException, IOException, SAXException     { Â
        // Creating an object of the class         GFG example = new GFG(); Â
        // Calling the Method 2 in main() method and         // passing the file and directory path as arguments         // to it         example.writeParsedDataToFile(             "test-reading.pdf" ,             "/Users/ali_zhagparov/Desktop/pdf-content.txt" );     } Â
    // Method 2     // Writing parsed data to a file     public void     writeParsedDataToFile(String readFromFileName,                           String writeToFileName)         throws IOException, TikaException, SAXException     { Â
        // Creating an object of InputStream         InputStream stream             = this .getClass()                   .getClassLoader()                   .getResourceAsStream(readFromFileName); Â
        // Creating an object of File class         File yourFile = new File(writeToFileName); Â
        // If file is already existing then         // no operations to be performed         yourFile.createNewFile(); Â
        FileOutputStream fileOutputStream             = new FileOutputStream(yourFile, false );         Parser parser = new AutoDetectParser();         ContentHandler handler             = new BodyContentHandler(fileOutputStream);         Metadata metadata = new Metadata();         ParseContext context = new ParseContext(); Â
        parser.parse(stream, handler, metadata, context);     } } |
Â
Â
Output:
Â
There is nothing visible on the console window as there it files directory mapping where in this case it tries to write all information into a file
Â
Â
The program results in a ‘.txt’ with ‘.pdf’ file content which is as follows:
Â
Â