In the software industry, contents are transferred or portable via various documents with formats like TXT, XLS, or PDF or even sometimes with MP4 format also. As multiple formats are in use, there should be a common way to extract the content and metadata from them. It can be possible via Apache Tika, a powerful versatile library for content analysis. As an introduction, let us see how parsing can be done and get the contents, nature of the document, etc., by going through the features of Apache Tikka. Via a sample maven project, let us see them.
Example Maven Project
Project Structure:
First and foremost thing is we need to see the dependencies required for Apache Tika and those need to be specified in pom.xml
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.17</version> </dependency>
For our project, let us see all the dependencies via
pom.xml
XML
<? xml version = "1.0" encoding = "UTF-8" ?> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 < modelVersion >4.0.0</ modelVersion > < artifactId >apache-tika</ artifactId > < version >0.0.1-SNAPSHOT</ version > < name >apache-tika</ name > < parent > < groupId >com.gfg</ groupId > < artifactId >parent-modules</ artifactId > < version >1.0.0-SNAPSHOT</ version > </ parent > < dependencies > < dependency > < groupId >org.apache.tika</ groupId > < artifactId >tika-parsers</ artifactId > < version >${tika.version}</ version > </ dependency > </ dependencies > < properties > < tika.version >1.17</ tika.version > </ properties > </ project > |
Heart of apache tika is Parser API. While parsing the documents mostly Apache POI or PDFBox will be used
void parse( InputStream inputStream, // This is the input document that need to be parsed ContentHandler contentHandler, // handler by processing export the result in a particular form Metadata metadata, //metadata properties ParseContext parseContext // for customizing parsing process) throws IOException, SAXException, TikaException
Document type detection can be done by using an implementation class of the Detector interface. The below-mentioned method is available here
MediaType detect(java.io.InputStream inputStream, Metadata metadata) throws IOException
Language detection can also be done by Tika and identification of language is done without the help of metadata information. Now via the sample project java file contents, let us cover the topic as well
SampleTikaAnalysis.java
In this program, the following ways are handled both in detector and facade pattern
- Detecting document type
- Extracting the content using a parser and facade
- Extracting the metadata using a parser and facade
Java
import java.io.IOException; import java.io.InputStream; import org.apache.tika.Tika; import org.apache.tika.detect.DefaultDetector; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; public class SampleTikaAnalysis { // Detecting the document type by using Detector public static String detectingTheDocTypeByUsingDetector(InputStream inputStream) throws IOException { Detector detector = new DefaultDetector(); Metadata metadata = new Metadata(); MediaType mediaType = detector.detect(inputStream, metadata); return mediaType.toString(); } // Detecting the document type by using Facade public static String detectDocTypeUsingFacade(InputStream inputStream) throws IOException { Tika tika = new Tika(); String mediaType = tika.detect(inputStream); return mediaType; } public static String extractContentUsingParser(InputStream inputStream) throws IOException, TikaException, SAXException { Parser parser = new AutoDetectParser(); ContentHandler contentHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(inputStream, contentHandler, metadata, context); return contentHandler.toString(); } public static String extractContentUsingFacade(InputStream inputStream) throws IOException, TikaException { Tika tika = new Tika(); String content = tika.parseToString(inputStream); return content; } public static Metadata extractMetadatatUsingParser(InputStream inputStream) throws IOException, SAXException, TikaException { Parser parser = new AutoDetectParser(); ContentHandler contentHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(inputStream, contentHandler, metadata, context); return metadata; } public static Metadata extractMetadatatUsingFacade(InputStream inputStream) throws IOException, TikaException { Tika tika = new Tika(); Metadata metadata = new Metadata(); tika.parse(inputStream, metadata); return metadata; } } |
Let us test the above concepts by taking 3 documents namely exceldocument.xlsx, pdfdocument.txt, and worddocument.docx. They should be available under the test/resources folder so that they can be read from the mentioned way in the code. Let us test the contents now via
SampleTikaWayUnitTest.java
Java
import static org.hamcrest.CoreMatchers.containsString; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertThat; import java.io.IOException; import java.io.InputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.junit.Test; import org.xml.sax.SAXException; public class SampleTikaWayUnitTest { @Test public void withDetectorFindingTheResultTypeAsDocumentType() throws IOException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "pdfdocument.txt" ); String resultantMediaType = SampleTikaAnalysis.detectingTheDocTypeByUsingDetector(inputStream); assertEquals( "application/pdf" , resultantMediaType); inputStream.close(); } @Test public void withFacadeFindingTheResultTypeAsDocumentType() throws IOException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "pdfdocument.txt" ); String resultantMediaType = SampleTikaAnalysis.detectDocTypeUsingFacade(inputStream); assertEquals( "application/pdf" , resultantMediaType); inputStream.close(); } @Test public void byUsingParserAndGettingContent() throws IOException, TikaException, SAXException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "worddocument.docx" ); String documentContent = SampleTikaAnalysis.extractContentUsingParser(inputStream); assertThat(documentContent, containsString( "OpenSource REST API URL" )); assertThat(documentContent, containsString( "Spring MVC" )); inputStream.close(); } @Test public void byUsingFacadeAndGettingContent() throws IOException, TikaException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "worddocument.docx" ); String documentContent = SampleTikaAnalysis.extractContentUsingFacade(inputStream); assertThat(documentContent, containsString( "OpenSource REST API URL" )); assertThat(documentContent, containsString( "Spring MVC" )); inputStream.close(); } @Test public void byUsingParserAndGettingMetadata() throws IOException, TikaException, SAXException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "exceldocument.xlsx" ); Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingParser(inputStream); assertEquals( "org.apache.tika.parser.DefaultParser" , retrieveMetadata.get( "X-Parsed-By" )); assertEquals( "Microsoft Office User" , retrieveMetadata.get( "Author" )); inputStream.close(); } @Test public void byUsingFacadeAndGettingMetadata() throws IOException, TikaException { InputStream inputStream = this .getClass().getClassLoader().getResourceAsStream( "exceldocument.xlsx" ); Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingFacade(inputStream); assertEquals( "org.apache.tika.parser.DefaultParser" , retrieveMetadata.get( "X-Parsed-By" )); assertEquals( "Microsoft Office User" , retrieveMetadata.get( "Author" )); inputStream.close(); } } |
Output of JUnit test case:
- Test withDetectorFindingTheResultTypeAsDocumentType -> It is finding the document type by detector class and asserting the resultant document type to be pdf.
- Test withFacadeFindingTheResultTypeAsDocumentType -> It is finding the document type by facade class and asserting the resultant document type to be pdf.
- Test byUsingParserAndGettingContent -> By parsing and extracting the available word document in the mentioned path and asserting the resultant text.
- Test byUsingFacadeAndGettingContent -> By facade class, extracting the available word document in the mentioned path and asserting the resultant text.
- Test byUsingParserAndGettingMetadata -> By parsing and extracting the available excel document in the mentioned path and getting the metadata and asserting that
- Test byUsingFacadeAndGettingMetadata -> By facade class, extracting the available excel document in the mentioned path and getting the metadata, and asserting that
Conclusion
Apache Tika is a wonderful content analysis versatile library used across the software industry for multiple purposes.