Saturday, May 2, 2009

Text Engineering Frameworks

What is a text engineering framework?

With the volume of unstructured text going through the roof, and the need to make sense of them, so are the efforts to analyze them. Different software tools for language analysis and data mining, attacking myriad language analysis problems have been developed. While each system concentrates on solving the problem at hand, there remains the enviable task of gluing together these language technologies. All language technologies need to worry about common problems like representation of data and metadata, modularization of the software components, and interaction between them.

Each system takes its own approach to handling these problems, in addition to solving the central problem. This is where a text engineering system steps in. What a text engineering framework provides is an architecture and out-of-the-box support for rapid development of highly modularized, scalable language technology components, which can interface with other components - thus improving the process of creating language technology applications. The framework does all the plumbing necessary to create interesting language technology applications. A good analogy would be that the framework is the OS platform on which applications are built.

Architecture of a Text Engineering Framework

While different systems may have their own architectures, the generic architecture described here is the one that forms the basis of the two most popular text engineering frameworks, GATE (General Architecture for Text Engineering) and UIMA (Unstructured Information Management Access). The two key services that the framework provides are: data/metadata management services and analysis component development services.

Data Management Services

The most important problem facing NLP tools is the management of data, hence the representation of data is given a central importance in the framework. The basic unit of unstructured data to be analyzed is a Document. This corresponds to a single artifact to be analyzed like a single medical report, a news article, etc. The unstructured data need not be restricted to text, but it could be audio, video and other multimedia data. The focus of this article would be text, but most the concepts elaborated here would apply to other media too. In NLP applications, it is common to process large collections of documents for analysis. The framework represents a collection of Documents by a Corpus abstraction.

Each NLP tool generates metadata for the Document. For instance, a tokeniser would generate tokens, a POS tagger would generate Part-Of-Speech tags for each token, a noun phrase chunker would identify noun phrase chunks and a named entity recognizer would generate labels for chunks of text. There needs to be a consistent method to represent all this metadata. This is achieved by using an Annotation object, which represents metadata associated with a contiguous chunk of text. To illustrate the idea, consider the following sentence:
"In a perfect world, all the people would be like cats are, at two o'clock in the afternoon."

The tokenizer would identify tokens, each token like "perfect" represented by an Annotation, whose type is "Token". Each annotation has a start and end offset associated with it, which identifies its position in the Document. Information about the annotation can be stored in a key-value pairs called Features. This allows arbitrarily complex data to be associated with the annotation. For instance, the Token annotation could have a "string" feature to represent the text of the token, a "kind" feature to indicate if the token is a word, number, or punctuation, a "root" feature which contains its morphological root.

The scheme of representing metadata described above allows different kinds of metadata from different NLP components to be accessed and manipulated using the same interface. Positional information about the metadata can be captured, and arbitrarily complex data can be associated - since the feature values could be complex objects themselves. Annotations can be added at various levels of detail to the same chunk of text. For instance, the phrase "a perfect world" can have "Token" annotations for each token, "POS" annotations to represent part-of-speech information for each token, "NP" annotation over the entire phrase to represent a noun phrase chunk. I

It should now be obvious that the annotations constitute a data exchange format between various NLP components, to build more complex analysis of the text. An entire declarative type system can be built using these annotations for an application, as is done in UIMA. It is possible to do pattern matching over these annotations, as provided by the JAPE language in GATE. The frameworks provide implementations of these abstractions, thus freeing applications from the data management chores.

The architecture decribed above evolved during the TIPSTER conferences . One of the popular ways of serializing this data is the XML stand-off markup, which separates the annotation metadata from the data.

Text Analysis Development Services

NLP applications generally consist of a number of steps, each doing some part of the analysis, building upon the analysis done in the previous stage. To support this application development paradigm, the framework represents e ach NLP task by a processing resource (PR). The PR is a component which performs a single task like tokenizing, POS tagging, or something even simpler like mapping one set of annotations to another (for adaption purposes). The data interface to the PR is specified by the kind of input annotations that it requires, and the annotations it generates. For instance, the POS tagger requires "Token" annotation as input and generates "POS" annotation as output. The PR's role can be more accurately characterized as an annotator. Each PR is a reusable software component, that can be used in a creating NLP applications. The same POS tagger can be used in different applications as long as its input and output requirements are satisfied. A number of PRs can be strung together to create a pipeline. A example of an NP-chunking pipeline is shown below.

This pipeline is a sequential pipeline, but you can as well imagine conditional, looped and other pipeline configurations. The scheme described above constitutes a modular, loosely-coupled architecture for a text engineering application. Each PR in the pipeline may be replaced by an equivalent PR as long as it satisfies the data interface requirements, allowing you to test different configurations. The framework defines the common interfaces for PRs, provides different pipeline implementations and allows for declarative specification of PRs and pipelines. In a nutshell, the framework provides all the plumbing required to build an NLP application, while the developer can focus on developing the smart innovations.

Other facilities provided by the framework

For making the application development easier,
1. The framework provides visual tools for managing language resources, creating pipelines, running applications, observing annotations, editing annotations, creation of training sets.
2. The framework may ship with off-the-shelf components for common NLP tasks like tokenization, sentence identification, dictionary lookups, POS tagging, machine learning interfaces, etc. This allows rapid prototyping of applications , using these ready-to-use components. GATE, for example, ships with the ANNIE toolkit.
3. The framework developers maintain a component repository, which allow the developer community to share the resusable PRs that are developed, and make use of the work done by others.

In summary, if you are developing NLP applications you should use a text engineering framework to make use of the wealth of components that have been developed, increase productivity and build NLP applications which are modular and loosely coupled.