The full of ODF is Open Document Format. it is an international family of standards that’s the successor of commonly used deprecated vendor-specific document formats like .doc, .wpd, .xls . ODF documents are smaller when compared to other formats. OpenDocumentParser class is used from TIKA library to extract the content from the ODF file.
Methods used:
- BodyContentHandler(): It creates a content handler that writes XHTML body character events to an internal string buffer.
- Metadata() : It constructs new, empty metadata.
- ParseContext(): It creates a parse context object that is used to pass context information to Tika parsers.
- parse(): Instantiate the parser object, and invoke the parse method.
Following are the dependencies required for executing the following java code:
tika-parsers-1.24.1.jar commons-io-2.8.0.jar slf4j-api-2.0.0-alpha0.jar
Implementation:
Java
// Java Program to Extract Content from a ODF file import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.odf.OpenDocumentParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import sun.security.util.Length; public class OdfContentExtractor { public static void main(String[] args) { try { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); // Here .odt is open document text format. FileInputStream inputstream = new FileInputStream( new File( "F:\\geeks.odt" )); ParseContext parsecontent = new ParseContext(); // Parsing the open document. OpenDocumentParser opendocumentparser = new OpenDocumentParser(); // Passing the InputStream , ContentHandler, // Metadata , ParseContext to the parse method. opendocumentparser.parse(inputstream, handler, metadata, parsecontent); System.out.println( "Content in the document :" + handler.toString()); // Displaying the metadata of the odf file. System.out.println( "Metadata of the document:" ); String[] metaName = metadata.names(); int l = metaName.length; for ( int i = 0 ; i < l; i++) { System.out.println( metaName[i] + " : = " + metadata.get(metaName[i])); } } catch (Exception e) { System.out.println( "failed to extract content due to " + e); } } } |
Output:
Content in the document :Geekforgeeks has a great content on DSA. Metadata of the document: date : = 2020-11-21T05:38:00Z meta:paragraph-count : = 1 meta:word-count : = 6 meta:initial-author : = Mohan Sai initial-creator : = Mohan Sai dc:creator : = Mohan Sai generator : = MicrosoftOffice/15.0 MicrosoftWord Word-Count : = 6 dcterms:created : = 2020-11-21T05:36:00Z dcterms:modified : = 2020-11-21T05:38:00Z Last-Modified : = 2020-11-21T05:38:00Z nbPara : = 1 Last-Save-Date : = 2020-11-21T05:38:00Z meta:character-count : = 40 Paragraph-Count : = 1 meta:save-date : = 2020-11-21T05:38:00Z modified : = 2020-11-21T05:38:00Z Edit-Time : = PT0S nbCharacter : = 40 nbPage : = 1 nbWord : = 6 Content-Type : = application/vnd.oasis.opendocument.text creator : = Mohan Sai meta:author : = Mohan Sai meta:creation-date : = 2020-11-21T05:36:00Z Creation-Date : = 2020-11-21T05:36:00Z xmpTPg:NPages : = 1 Character Count : = 40 editing-cycles : = 3 Page-Count : = 1 Author : = Mohan Sai meta:page-count : = 1