Apache Tika to the Rescue

This is the eighth post for the build phase in the first Build-Measure-Learn cycle. In the last post I set up the first acceptance test, which has failed because the implementation is still based on mocks. In this post, I will describe the implementation with Apache Tika to enable it to pass the test.

Context

The project is still in the build phase of the first Build-Measure-Learn cycle. There is a major user task called ‘Storing text files’ and a derived user story ‘Import a text or document file’. This story was broken down further into smaller stories. The story ‘Import a file with a single text paragraph’ results from the previous refinement step. At present, the acceptance test for the small story ‘Import a file with a single text paragraph’ has failed.

Apache Tika

Apache Tika is a toolkit for text and metadata extraction (see [ApacheTika]). It supports a wide range of document formats. XHTML is used to model the structured content of the document. To support different document formats, Apache Tika uses different parser libraries, which are unified under a single API. The key concept of Apache Tika is the org.apache.tika.parser.Parser interface with its central method #parse.

Central method #parse of interface org.apache.tika.parser.Parser
Central method #parse of interface org.apache.tika.parser.Parser

The method #parse takes an input stream from the document and returns a stream of XHTML SAX events. This enables streamed processing. This stream will be employed to build an individual document model. The events will be sent to and handled by a org.xml.sax.ContentHandler.

The Text Analyzer Platform (TAP) has an implementation of a DocumentParser that employs Apache Tika.

The #parse method of class DocumentParserTika employing Apache Tika
The #parse method of class DocumentParserTika employing Apache Tika

This blog post is not a complete introduction into the usage of Apache Tika. If you are interested in using Apache Tika, then you should consult other sources like the official website [ApacheTika].

Collecting Parser Events

The different kinds of SAX events are predefined by the interface org.xml.sax.ContentHandler. The DocumentParser provides the TapContentHandler as implementation of this interface. The ordered sum of the SAX events defines the content of the text file. An overview of the ContentHandler interface is illustrated in the following table.

API of org.xml.sax.ContentHandler
API of org.xml.sax.ContentHandler

Parsing the Word® file simple-text-passage.docx results in a bunch of SAX events. All events are collected by a ParseEventCollector. The following overview table shows the relevant SAX events thrown by Apache Tika.

Overview of relevant SAX events
Overview of relevant SAX events

I have deliberately omitted the SAX events for ´ignorable whitespace´, also ´start element´ and ´end element´ with the local name ´meta´. These events are not applicable for this case. In case of our example Word® file, simple-text-passage.docx, the following events are relevant:

  • The paragraph element defined by the events ´start element´ and ´end element´ with local name ´p´.
  • The ´characters´ events providing the text.

The paragraph element encloses an element ´a´ and two ´characters´ events. The link element ´p´ is not important here particularly because the link refers to a go back function, probably offered by Microsoft Word®[1]. Therefore, only the events for element ´p´ and both events ´characters´ play a role when building a document model.

A Meaningful Document Model

The SAX events are based on the Document Object Model (DOM) that provides a model for HTML, XHTML and XML documents. The Text Analyzer Platform also needs a model representing documents in the system. Although the DOM is a standardized, fully-fledged document model, I have chosen a different concept. My intention is to develop a more meaningful document model, which is not technically oriented.

Document model for a simple paragraph
Document model for a simple paragraph

The above illustration shows a model that satisfies the current need to parse a document with a single paragraph. This model concept is simple enough to enable it to pass the failing acceptance test. The last, but essential step consists of the transformation of SAX events into a document model.

Transforming the SAX Events into a Document Model

The document parser receives the event ´start element´ for local name ´p´, both ´characters´ events and at last the ´end element´ for local name ´p´. The parser maps all ´characters´ events to the paragraph element. The described handling is implemented by the DocumentBuilder and its subordinated helper ParagraphBuilder.

Implementation of the DocumentBuilder
Implementation of the DocumentBuilder

When the ´start element´ is received then a new builder for a paragraph is created. Then the paragraph builder collects all the text given by the ´characters´ events. Finally, when the ParagraphBuilder receives the ´end element´, then the construction of a paragraph is finished.

Implementation of the ParagraphBuilder
Implementation of the ParagraphBuilder

The current implementation of the Text Analyzer Platform only takes the paragraph element into consideration. In this iteration, other elements, like headers or tables, are ignored. They will be added in an upcoming Build-Measure-Learn cycle.

What’s next?

In the next blog post, I will write about my first steps implementing a user interface to display a document. For this purpose, I will choose the Play! Framework.

Footnotes

[1] Microsoft Word is a registered trademark of Microsoft Corporation in the United States.

Resources

[ApacheTika] Apache Tika
<http://tika.apache.org> accessed November 20th, 2017

[DOMWiki] Wikipedia; ´Document Object Model´
<https://en.wikipedia.org/wiki/Document_Object_Model>
accessed December 25th, 2017

[SAX] SAX – About SAX
<http://www.saxproject.org> accessed December 17th, 2017

Leave a Reply