Apache Tika is a library that allows you to extract data from different documents(.PDF, .DOCX, etc.). In this tutorial, we will extract data by using BodyContentHandler.Next dependency that will be used is shown below:
<dependency> <groupId>org.apache.tika < / groupId > <artifactId>tika - parsers < / artifactId > <version>1.26 < / version > < / dependency >
BodyContentHandler is a class decorator that allows one to get everything inside XHTML <body> tag. <body> or <body/> will not be included into result value.
Let us discuss first various constructors of this class is as follows:Â
| BodyContentHandler() | Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown. |
|---|---|
| BodyContentHandler(writeLimit) |
Writes all content into an internal string buffer, to get content just call toString(). ‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown. |
| BodyContentHandler(OutputStream outputStream) | Writes all content into a given outputStream. Without any content limit. |
| BodyContentHandler(Writer writer) | Writes all content into a given writer. Without any content limit. |
| BodyContentHandler(ContentHandler handler) | Passes all content to a given handler. |
The methods of this class is as follows:
| Method | Action Performed |
|---|---|
| MatchingContentHandler | Allows you to get data by XPath |
Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.
Implementation:
Example 1: Reading everything into the inner string buffer
Java
// Java Program to Read Everything into Inner String BufferÂ
// Main classpublic class GFG {Â
    // Method 1    // To parse the string    public String parseToStringExample(String fileName)        throws IOException, TikaException, SAXException    {Â
        // Creating an object of InputStream class        InputStream stream            = this.getClass()                  .getClassLoader()                  .getResourceAsStream(fileName);Â
        Parser parser = new AutoDetectParser();        ContentHandler handler = new BodyContentHandler();        Metadata metadata = new Metadata();        ParseContext context = new ParseContext();Â
        // Parsing the string        parser.parse(stream, handler, metadata, context);Â
        return handler.toString();    }Â
    // Method 2    // Main driver method    public static void main(String[] args)        throws TikaException, IOException, SAXException    {Â
        // Creating object of main class in main method        GFG example = new GFG();Â
        // Display message for better readability        System.out.println("Result");Â
        // Calling the method 1 to parse string by        // providing file as an argument        System.out.println(example.parseToStringExample(            "test-reading.pdf"));    }} |
Â
Output:Â
Â
Â
Example 2: Writing content into a file with specifying the maximum content length
Â
Java
// Java Program to Write Content into File by// Specifying the Maximum Content LengthÂ
// Main class// BodyContentHandlerWriteToFileExamplepublic class GFG {Â
    // Method 1    // Main driver method    public static void main(String[] args)        throws TikaException, IOException, SAXException    {Â
        // Creating an object of the class        GFG example = new GFG();Â
        // Calling the Method 2 in main() method and        // passing the file and directory path as arguments        // to it        example.writeParsedDataToFile(            "test-reading.pdf",            "/Users/ali_zhagparov/Desktop/pdf-content.txt");    }Â
    // Method 2    // Writing parsed data to a file    public void    writeParsedDataToFile(String readFromFileName,                          String writeToFileName)        throws IOException, TikaException, SAXException    {Â
        // Creating an object of InputStream        InputStream stream            = this.getClass()                  .getClassLoader()                  .getResourceAsStream(readFromFileName);Â
        // Creating an object of File class        File yourFile = new File(writeToFileName);Â
        // If file is already existing then        // no operations to be performed        yourFile.createNewFile();Â
        FileOutputStream fileOutputStream            = new FileOutputStream(yourFile, false);        Parser parser = new AutoDetectParser();        ContentHandler handler            = new BodyContentHandler(fileOutputStream);        Metadata metadata = new Metadata();        ParseContext context = new ParseContext();Â
        parser.parse(stream, handler, metadata, context);    }} |
Â
Â
Output:
Â
There is nothing visible on the console window as there it files directory mapping where in this case it tries to write all information into a file
Â
Â
The program results in a ‘.txt’ with ‘.pdf’ file content which is as follows:
Â
Â

