Monday, November 18, 2024
Google search engine
HomeLanguagesJavaJava Program to Extract Content from a ODF File

Java Program to Extract Content from a ODF File

The full of ODF is Open Document Format. it is an international family of standards that’s the successor of commonly used deprecated vendor-specific document formats like .doc, .wpd, .xls . ODF documents are smaller when compared to other formats. OpenDocumentParser class is used from TIKA library to extract the content from the ODF file.

Methods used:

  1. BodyContentHandler(): It creates a content handler that writes XHTML body character events to an internal string buffer.
  2. Metadata() : It constructs new, empty metadata.
  3. ParseContext(): It creates a parse context object that is used to pass context information to Tika parsers.
  4. parse(): Instantiate the parser object, and invoke the parse method.

Following are the dependencies required for executing the following java code:

tika-parsers-1.24.1.jar
commons-io-2.8.0.jar
slf4j-api-2.0.0-alpha0.jar

Implementation:

Java




// Java Program to Extract Content from a ODF file
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import sun.security.util.Length;
 
public class OdfContentExtractor {
    public static void main(String[] args)
    {
 
        try {
            BodyContentHandler handler
                = new BodyContentHandler();
 
            Metadata metadata = new Metadata();
 
            // Here .odt is open document text format.
            FileInputStream inputstream
                = new FileInputStream(
                    new File("F:\\geeks.odt"));
            ParseContext parsecontent = new ParseContext();
 
            // Parsing the open document.
            OpenDocumentParser opendocumentparser
                = new OpenDocumentParser();
 
            // Passing the InputStream , ContentHandler,
            // Metadata , ParseContext to the parse method.
            opendocumentparser.parse(inputstream, handler,
                                     metadata,
                                     parsecontent);
            System.out.println("Content in the document :"
                               + handler.toString());
 
            // Displaying the metadata of the odf file.
            System.out.println("Metadata of the document:");
            String[] metaName = metadata.names();
            int l = metaName.length;
            for (int i = 0; i < l; i++) {
                System.out.println(
                    metaName[i]
                    + " : =  " + metadata.get(metaName[i]));
            }
        }
        catch (Exception e) {
 
            System.out.println(
                "failed to extract content due to " + e);
        }
    }
}


Output:

Content in the document :Geekforgeeks has a great content on DSA.

Metadata of the document:
date : =  2020-11-21T05:38:00Z
meta:paragraph-count : =  1
meta:word-count : =  6
meta:initial-author : =  Mohan Sai
initial-creator : =  Mohan Sai
dc:creator : =  Mohan Sai
generator : =  MicrosoftOffice/15.0 MicrosoftWord
Word-Count : =  6
dcterms:created : =  2020-11-21T05:36:00Z
dcterms:modified : =  2020-11-21T05:38:00Z
Last-Modified : =  2020-11-21T05:38:00Z
nbPara : =  1
Last-Save-Date : =  2020-11-21T05:38:00Z
meta:character-count : =  40
Paragraph-Count : =  1
meta:save-date : =  2020-11-21T05:38:00Z
modified : =  2020-11-21T05:38:00Z
Edit-Time : =  PT0S
nbCharacter : =  40
nbPage : =  1
nbWord : =  6
Content-Type : =  application/vnd.oasis.opendocument.text
creator : =  Mohan Sai
meta:author : =  Mohan Sai
meta:creation-date : =  2020-11-21T05:36:00Z
Creation-Date : =  2020-11-21T05:36:00Z
xmpTPg:NPages : =  1
Character Count : =  40
editing-cycles : =  3
Page-Count : =  1
Author : =  Mohan Sai
meta:page-count : =  1

RELATED ARTICLES

Most Popular

Recent Comments