On Organizing Information: 2009

Tuesday, August 11, 2009

Book Review: The Numerati

With the advent of the Web and the fall in electronic prices, we have seen an explosion in digital data in the form of huge databases collecting various pieces of information to ever larger collection of documents. The Numerati (a portmanteau between the Number and Illuminati) are the statisticians, mathematicians, computer scientists, linguists and others involved in making sense of this data using sophisticated statistical techniques. The book describes the kind of problems being solved in the following areas, citing various examples at a bunch of organizations like IBM, Intel, Umbria, etc.:

Workers - building employee profiles, understanding employee networks, using it for optimal use of resources
Shoppers - microtargeting shoppers using personal information to customize service, give recommendations and increase sales
Voters - Understanding voter intent, issues - so that campaign messages can be targeted to focussed groups.
Bloggers - Understanding public opinion from the information on blogosphere, useful to understand sentiments on products, etc.
Medicine - Baker focusses on futuristic health monitoring (like floor tiles which capture your walking patterns!), whereaas he totally ignores contemporary challenges and work in analyzing medical records, genomic and proteomic data.
Terrorism
Match Making

All this comes at a cost. The Numerati has access to vast amounts of personal data, and we don't need an Orwellian Big Brother who is going to use it to learn about us, turn us into commodities and control our lives.

That's about it in the book - it can be a brisk read, which - you can give it a miss if you think you are familiar with the above topics.

Book Review: The Lady Tasting Tea

A lady claims that the taste of tea differs when milk is poured to tea leaves as opposed to adding tea leaves into a cup of milk. Everyone at the small party scoffs at the suggestion, except Ronald Aylmer Fisher. Fisher designs an experiment that would statistically establish the lady's claims. He creates a sample set containing tea prepared in either ways, and lo and behold - the story goes that the lady identifies each cup correctly. Fisher uses this example to explain the design of experiments in his book 'The Design of Experiments'. This anecdote sets up the book. 'The Lady Tasting Tea' is the story of the development of statistics, Fisher having built the pillars of statistics as it stands today.

I started reading this book, while looking around to brush my statistics; thought it would be a good idea to know the history of the subject I am exploring. That's particularly relevant in sciences filled with uncertainties like statistics, economics, linguistics; where the characteristics of the individual seem to contribute to the development of the theory, and there's a story behind things which seem arbitrary.

David Salsburg takes us through an entertaining journey starting with the earliest breakthroughs by Karl Pearson and William Gossett, going to the pioneering foundational works of the acerbic genius Ronald Fisher, the cheerful Jerzy Newman, and the multitalented Andrei Kolmogorov. Apart from these pioneers, Salsburg very vividly sketches the lives and contributions of Egon Pearson (hypothesis testing), Chester Bliss (probit analysis), John Tukey (exploratory data analysis), Frank Wilcoxon (non-parametric methods), EJG Pitman (non-parametric methods), Prasanta Chandra Mahalabonis (sampling theory), Samuel Wilks (Founder - Statistical Research Group, Princeton) , George Box (robust statistics) and Edward Deming (statistical quality control).

Some of the chapter names are interesting, and they are as good as the title of the book. It reminds me of 'The Mythical Man Month''s memorable illustrative sketches. Sample this:

The Mozart of Mathematics - Andrei Kolmorogov
The Picasso of Statistics - John Tukey
The March of the Martingales - on the work of Paul Levy

Read this if you are a fan of scientific history.

Saturday, May 2, 2009

Text Engineering Frameworks

What is a text engineering framework?

With the volume of unstructured text going through the roof, and the need to make sense of them, so are the efforts to analyze them. Different software tools for language analysis and data mining, attacking myriad language analysis problems have been developed. While each system concentrates on solving the problem at hand, there remains the enviable task of gluing together these language technologies. All language technologies need to worry about common problems like representation of data and metadata, modularization of the software components, and interaction between them.

Each system takes its own approach to handling these problems, in addition to solving the central problem. This is where a text engineering system steps in. What a text engineering framework provides is an architecture and out-of-the-box support for rapid development of highly modularized, scalable language technology components, which can interface with other components - thus improving the process of creating language technology applications. The framework does all the plumbing necessary to create interesting language technology applications. A good analogy would be that the framework is the OS platform on which applications are built.

Architecture of a Text Engineering Framework

While different systems may have their own architectures, the generic architecture described here is the one that forms the basis of the two most popular text engineering frameworks, GATE (General Architecture for Text Engineering) and UIMA (Unstructured Information Management Access). The two key services that the framework provides are: data/metadata management services and analysis component development services.

Data Management Services

The most important problem facing NLP tools is the management of data, hence the representation of data is given a central importance in the framework. The basic unit of unstructured data to be analyzed is a Document. This corresponds to a single artifact to be analyzed like a single medical report, a news article, etc. The unstructured data need not be restricted to text, but it could be audio, video and other multimedia data. The focus of this article would be text, but most the concepts elaborated here would apply to other media too. In NLP applications, it is common to process large collections of documents for analysis. The framework represents a collection of Documents by a Corpus abstraction.

Each NLP tool generates metadata for the Document. For instance, a tokeniser would generate tokens, a POS tagger would generate Part-Of-Speech tags for each token, a noun phrase chunker would identify noun phrase chunks and a named entity recognizer would generate labels for chunks of text. There needs to be a consistent method to represent all this metadata. This is achieved by using an Annotation object, which represents metadata associated with a contiguous chunk of text. To illustrate the idea, consider the following sentence:
"In a perfect world, all the people would be like cats are, at two o'clock in the afternoon."

The tokenizer would identify tokens, each token like "perfect" represented by an Annotation, whose type is "Token". Each annotation has a start and end offset associated with it, which identifies its position in the Document. Information about the annotation can be stored in a key-value pairs called Features. This allows arbitrarily complex data to be associated with the annotation. For instance, the Token annotation could have a "string" feature to represent the text of the token, a "kind" feature to indicate if the token is a word, number, or punctuation, a "root" feature which contains its morphological root.

The scheme of representing metadata described above allows different kinds of metadata from different NLP components to be accessed and manipulated using the same interface. Positional information about the metadata can be captured, and arbitrarily complex data can be associated - since the feature values could be complex objects themselves. Annotations can be added at various levels of detail to the same chunk of text. For instance, the phrase "a perfect world" can have "Token" annotations for each token, "POS" annotations to represent part-of-speech information for each token, "NP" annotation over the entire phrase to represent a noun phrase chunk. I

It should now be obvious that the annotations constitute a data exchange format between various NLP components, to build more complex analysis of the text. An entire declarative type system can be built using these annotations for an application, as is done in UIMA. It is possible to do pattern matching over these annotations, as provided by the JAPE language in GATE. The frameworks provide implementations of these abstractions, thus freeing applications from the data management chores.

The architecture decribed above evolved during the TIPSTER conferences . One of the popular ways of serializing this data is the XML stand-off markup, which separates the annotation metadata from the data.

Text Analysis Development Services

NLP applications generally consist of a number of steps, each doing some part of the analysis, building upon the analysis done in the previous stage. To support this application development paradigm, the framework represents e ach NLP task by a processing resource (PR). The PR is a component which performs a single task like tokenizing, POS tagging, or something even simpler like mapping one set of annotations to another (for adaption purposes). The data interface to the PR is specified by the kind of input annotations that it requires, and the annotations it generates. For instance, the POS tagger requires "Token" annotation as input and generates "POS" annotation as output. The PR's role can be more accurately characterized as an annotator. Each PR is a reusable software component, that can be used in a creating NLP applications. The same POS tagger can be used in different applications as long as its input and output requirements are satisfied. A number of PRs can be strung together to create a pipeline. A example of an NP-chunking pipeline is shown below.

This pipeline is a sequential pipeline, but you can as well imagine conditional, looped and other pipeline configurations. The scheme described above constitutes a modular, loosely-coupled architecture for a text engineering application. Each PR in the pipeline may be replaced by an equivalent PR as long as it satisfies the data interface requirements, allowing you to test different configurations. The framework defines the common interfaces for PRs, provides different pipeline implementations and allows for declarative specification of PRs and pipelines. In a nutshell, the framework provides all the plumbing required to build an NLP application, while the developer can focus on developing the smart innovations.

Other facilities provided by the framework

For making the application development easier,
1. The framework provides visual tools for managing language resources, creating pipelines, running applications, observing annotations, editing annotations, creation of training sets.
2. The framework may ship with off-the-shelf components for common NLP tasks like tokenization, sentence identification, dictionary lookups, POS tagging, machine learning interfaces, etc. This allows rapid prototyping of applications , using these ready-to-use components. GATE, for example, ships with the ANNIE toolkit.
3. The framework developers maintain a component repository, which allow the developer community to share the resusable PRs that are developed, and make use of the work done by others.

In summary, if you are developing NLP applications you should use a text engineering framework to make use of the wealth of components that have been developed, increase productivity and build NLP applications which are modular and loosely coupled.

Sunday, April 26, 2009

De-Identification of Personal Health Information

I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.

Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The HIPAA regulation lays down the rules for the handling of PHI.

Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.

There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (Entire list here). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.

A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this paper. Here are a few de-identification systems that are available:

PhysioNet DeId (Open Source)
Harvard Medical School Scrubber (Open Source)
Data Corp DeId (Commercial)

For research purposes, a gold standard data set containing surrogate PHI data is available on the PhysioNet page.

Saturday, April 25, 2009

Yet Another Blog On Organizing Information

Data and information everywhere. The digital age is generating so much information, that it has fast outgrown our ability to comprehend it. 'Information Overload', we call it. These are the questions that are posed to us:

How do I find information that I want?
What information is relevant to my need?
Ok, this is way too much information than I can handle. I would like to have summary of the same.
In this huge infobase, is there some useful information that isn't obvious? Some patterns, trends that may be useful.
There are a lot of smart people generating content. How can the collective intelligence of these people augment my search for information?

These questions have had us hooked for a long time, and so have the solutions people have developed to tackle these questions. Search engines to help you find information, business intelligence tools to make find patterns in huge volumes of data, information extraction systems to summarize information in human generated content, recommendation systems to bring information relevant to your need and study of social networks to harness the "collective intelligence" of the crowd.

The rabbit hole goes deeper. These solutions are built on the more fundamental sciences of statistics, pattern recognition, artificial intelligence and natural language understanding.

This is not the end, for the more fundamental questions we are posed with are about the nature of cognition, the understanding of language, the organization of the knowledge and the active role of the human observer in the perception of information. I think this is the holy grail that we are all in pursuit of.

We are beginners in this exciting field,. This is a place to share what we learn, what we do and to benefit from the "collective intelligence" of all who visit this page.

While the challenges span many problems, there are some that we are currently working on. Dhaval currently works on optimizing ad-networks and takes an active interest in search engines. I currently work on information extraction from text and medical informatics. So for now you may find a certain bias towards these topics, and related topics on this blog.