On Organizing Information

No Roman Hindi please

2015-01-12T00:09:00.002+05:30

Chetan Bhagat wrote an opinion piece calling for the replacement of the Devanagari script with the Roman script, describing it as an essential step for saving the Hindi language.

This is fundamentally a bad idea since:

The Roman script is clearly inferior to the Devanagari script. For instance, it is ambiguous in representing sounds: c can either be च (as in 'touch') or क (as in 'cut'). Why would you want to throw away a script designed on scientific principles of sound organization for one which is fairly arbitrary.
While having language specific hardware keyboards never took off, in the era of touch keyboards designing language specific keyboards is no barrier at all, and all smart tricks done for English keyboards (word completion, swipe, etc.) can be easily replicated for Devanagari. In fact, we can have innovative designs to make input easier. We can go further and have handwriting recognition systems.
In fact, even if we have to use the Roman keyboard on physical keyboards, there is no reason to adopt the Roman script for the language. Transliteration systems have become quite good to handle a wide variety of ambiguous mappings from Roman to Devanagari.
If there is a need for a common national script, then Devanagari should be the natural choice since it can be representative of all major Indian scripts and follows the same principles. In fact, languages in India which don't have much of a written history should be based on extensions of Devanagari. That will surely will be a political hot potato, so we maybe revive the Brahmi script with suitable extensions to accomodate all scripts in India, since they are but variants of the Brahmi script.

What is needed is that free, open source input solutions be developed for these core input methods so that they are widely and easily available and can become building blocks for language technologies.

Chetan Bhagat was recently advocating non-jugaad solutions to Uber for the new-generation transportation solutions. I wonder why he proposes such jugadu solutions in this case? In fact, he is bent on destroying a 2000 year old, well-engineered solution.

Technical reasons apart, I don't know if there is a reason for this unwarranted alarm since the language seems to be thriving. While I don't follow Hindi literature, atleast in popular culture (news, TV, Internet, etc.) the availability of Hindi content has only increased. The Union Government and the state governments in Hindi speaking states use Hindi for their official activities. In any case, how will the change of script help to preserve the language? Bhagat does not put forth any reasons. I agree that the use of English as the language of power and intellectual discourse may put regional languages at risk in the future, but the solution would be to enable people to access content and communicate in their native languages as has been done in Europe. With the rapid development of language technologies in recent times, that is clearly possible. Making people use foreign scripts will only result in a sense of inferiority and cut them off from the vast literature which is written in the Devanagari script. Instead of rejuvenating the language, it may just hasten its death.

Statistical Machine Translation: Resources for Indian languages

2014-11-08T20:52:00.000+05:30

At the Center For Indian Language Technology, IIT Bombay, we have hosted Shata-Anuvaadak (100 Translators), a broad coverage Statisitical Machine Translation system for Indian languages. It currently supports translation between 11 Indian languages:

Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
Dravidian languages: Tamil, Telugu, Malayalam
English

It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. The system can be accessed at:

http://www.cfilt.iitb.ac.in/indic-translator

For more details, see the following publication:

Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. 2014. Shata-Anuvadak: Tackling Multiway Translation of Indian Languages . Language and Resources and Evaluation Conference (LREC 2014). 2014.

We are also making available software and resources developed in the Center for the system and for ongoing research. These are available under an open source license for research use. These include:

Software

Indian Language, NLP tools: Common NLP tools for Indian languages that are useful for machine translation. Unicode Normalizers, Tokenizers, Morphology-analysers and Transliteration system.
Source Side Reodering system
A simple experiment management system for Moses

Resources

Translation Models for Phrase based SMT systems all language pairs in Shata-anuvaadak
Language Models for all language in Shata-anuvaadak
Transliteration models for some language pairs (Moses-based)

You can access these resources at:

http://www.cfilt.iitb.ac.in/static/download.html

LREC 2014 - Some paper/posters

2014-06-04T15:35:00.002+05:30

I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across.

LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for on machine translation, crowdsourcing and grammar correction.

In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets.

Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters.

The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work:

Machine Translation

Aligning Parallel Texts with InterText: Pavel Vondřička

A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.

The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: Ahmed Abdelali, Francisco Guzman, Hassan Sajjad and Stephan Vogel

The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages.

Machine Translationness: Machine-likeness in Machine Translation Evaluation: Joaquim Moré and Salvador Climent

Machine Translation for Subtitling: A Large-Scale Evaluation -- Thierry Etchegoyhen, Lindsay Bywood, Mark Fishel, Panayota Georgakopoulou, Jie Jiang, Gerard Van Loenhout, Arantza Del Pozo, Mirjam Sepesy Maucec, Anja Turner and Martin Volk

Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality.

On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: Joke Daems, Lieve Macken and Sonia Vandepitte

Translation Errors from English to Portuguese: an Annotated Corpus: Angela Costa , Tiago Luís and Luísa Coheur

A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated.

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: Sharid Loaiciga, Thomas Meyer and Andrei Popescu-Belis

The taraXÜ Corpus of Human-Annotated Machine Translations - Eleftherios Avramidis, Aljoscha Burchardt, Sabine Hunsicker, Maja Popović, Cindy Tscherwinka, David Vilar and Hans Uszkoreit

CFT13: a Resource for Research into the Post-editing Process - Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao

HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation - Ondrej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Stranak, Vit Suchomel, Aleš Tamchyna and Daniel Zeman

A Corpus of Machine Translation Errors Extracted from Translation Students Exercises - Guillaume Wisniewski, Natalie Kübler and François Yvon

Innovations in Parallel Corpus Search Tools - Martin Volk, Johannes Graën and Elena Callegaro

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer - Timur Gilmanov, Olga Scrivner and Sandra Kübler

GRAMMAR CORRECTION

The MERLIN corpus: Learner Language and the CEFR: Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne,Barbora Štindlová and Chiara Vettori

The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds:

Metadata about the author and the test
Test ratings according to the CEFR framework
Error annotations
Annotations to encourage second language acquisition research

Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed.

KoKo: an L1 Learner Corpus for German: andrea Abel, Aivars Glaznieks, Lionel Nicolas* and Egon Stemle

A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development.

Building a Reference Lexicon for Countability in English: Tibor Kiss, Francis Jeffry Pelletier and Tobias Stadtfeld

The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach

Large Scale Arabic Error Annotation: Guidelines and Framework: Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani and Kemal Oflazer

Learner corpora for Arabic as L2

A Comparison of MT Errors and ESL Errors - Homa B. Hashemi and Rebecca Hwa

Crowdsourcing

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl

Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages - manjira sinha, Tirthankar Dasgupta and Anupam Basu

Collaboration in the Production of a Massively Multilingual Lexicon - Martin Benjamin

Online Experiments with the Percy Software Framework - Experiences and some Early Results - Christoph Draxler

sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl

Some interesting papers

A Database for Measuring Linguistic Information Content - Richard Sproat, Bruno Cartoni, Hyunjeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar and Evelyn Wenzel-Grondie
Developing Politeness Annotated Corpus of Hindi Blogs - Ritesh Kumar
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus - Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan and Ann Sawyer
The Ellogon Pattern Engine: Context-free Grammars over Annotations - Georgios Petasis
Etymological WordNet: Tracing the History of Words - Gerard de Mello
Distributed Distributional Similarities of Google Books over the Centuries - Martin Riedl, Richard Steuer and Chris Biemann
Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings - Paul Buitelaar, Georgeta Bordea and Barry Coughlan
Linguistic Landscaping of South Asia using Digital Language Resources: Genetic vs. Areal Linguistics - Lars Borin, Anju Saxena, Taraka Rama and Bernard Comrie
Indian Subcontinent Language Vitalization - András Kornai and Pushpak Bhattacharyya

A Systematic Exploration of Diversity in Machine Translation - Paper Summary

2013-11-23T19:42:00.000+05:30

An interesting paper regarding generating top-k translation outputs.

Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013). A Systematic Exploration of Diversity in Machine Translation.EMNLP 2013

This paper discusses:

1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing

The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.

This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.

Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists.

A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes.

With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.

Large Data sources for NLP from Google

2013-11-12T14:21:00.001+05:30

Google has made available two large and rich sources for NLP research:

These have been described in the following papers:

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science. 2011.

Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.

These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.

Google N-gram corpus
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:

One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.
Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form:

the burnt_NOUN car (combination of POS tag and token queries)
_DET_ _NOUN_ (queries involving determiners only)

There some restrictions on the 4 and 5-grams available, in order to prevent
combinatorial explosion.

Information on head-modifer relations in also available, though the relation type is not specified

You can use the Google N-gram viewer to query this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc.

Google Syntactic N-gram corpus
While, the traditional n-grams contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing n content words. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 viz. nodes, arcs, biarcs, triarcs and quadarcs. There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:

As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.
Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This information is part of the extended component of the corpus ( extended-arcs, extended-arcs, etc.)
verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized.

The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc.

Hierarchical Phrase Based models

2013-10-19T19:54:00.003+05:30

I read the David Chiang's ACL'05 paper on hierarchical phrase based models today. A quick summary:

Design Principles:

Formal, but not linguistic i.e. a syncronous CFG is used, however the grammar learnt may not correspond to a linguistic ('human'?) grammar.
Leverage the strengths of phrase-based system while moving to syntax based

Basic Motivation:

Basic idea is to handle long distance reorderings that a phrase based model can't handle.
This is done by introducing a single non-terminal 'X' and having rules of the form:

X-> a X_1 b X_2 c | d X_2 e X_1 f

where the subscripts indicate relative positions of the RHS non-terminals

In theory, the number of non-terminals on the RHS is not constrained. However, the limitation of this is that reorderings that happen at higher levels of a constituent parse tree may not be captured. The rules learnt by this system are more like lexicalized reordering templates.

Special types of rules used:

Glue rules: top level rule
Entity rules: for translating dates, numbers, etc.

Learning rules

The starting point is the phrases learnt by a phrase based system, called 'initial phrase pairs' . From each initial phrase pair, rules are extracted. In order to avoid too many rules and reduce spurious derivations, some heuristics are used. One noteworthy heuristic is that rules as constructed from as small initial phrase pairs as possible. Another is that each rule can have only two non-terminals on the RHS. This is done for decoding efficiency, probably because CYK algorithm expects a grammar with CNF where every rule has two non-terminals.

The model

The model is very similar to the phrase based model, a log-linear model with the same features, except that the phrase translation probabiliites are replaced by the rule translation probabilities. The probabilities are learnt in similar way.

Decoding

Decoding is done via a CYK variant. The differences from a standard CYK parsing are:
- Parsing is done only for the source language sentence. So far so good.
- There is only one non-terminal. You would except this to make this to make the parsing easier. However, there is a catch.
- The language model of the target language has to be in integrated into the decoder. The paper says, "the language model is integrated by intersecting with the target side CFG", which I take to mean that the LM score of the sub-string spanned by a cell in the chart parsing is multiplied along with the rule weights. This means each cell has to keep track of the rule along with all the target string that the rule can generate in that span. Each is like a virtual non-terminal, and hence the effective number of non-terminals can be really large, especially for larger spans.
What I have described here is naive, and the journal paper describes different strategies for integrating the language model. I will read up on and summarize that later.
- The grammar is not CNF, though every rule still has only two non-terminals. I guess it is converted to CNF before decoding.

Another interesting problem is how to kind the top-k parses. The journal article describes this in detail too.

Optimizations to decoding

- Limiting the number of entries in a cell of the chart
- Pruning entries in the cell with very low scores as compared to the highest scoring rule in the cell

References

Chiang, David. "A hierarchical phrase-based model for statistical machine translation." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
Chiang, David. "Hierarchical phrase-based translation." Computational Linguistics 33.2 (2007): 201-228.

N-gram features for text classification

2012-08-28T14:53:00.004+05:30

Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:

A Study Using n-gram Features for Text Categorization, Johannes Furnkranz

Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.

Origins of the Brahmi Script

2012-08-23T10:45:00.000+05:30

This post is motivated by chapter 2 of James Gleick's book '', which discusses the evolution of writing. Brahmi is the mother script from which the scripts of all modern Indian and South-East Asian languages have evolved. It was first seen in Emporor Ashoka's rock edicts dating tno the 3rd century B.C. It is then one of the ancient world's "alphabets" - along with Greek, Phoenician and Aramaic. The alphabet is based on the idea that symbols represent phonemes in contrast to other writing systems like logographic (e.g. Chinese which employs symbols for words) or syllabic (e.g. Japanese where symbols represent syllables).

All the alphabetic scripts are said to be derived from a single script, the Phoenician. In fact, the very word 'alphabet' comes from the first two symbols in the Greek script 'Alpha' and 'Beta'. There is a lack of clarity on the origin of the Brahmi script, with two primary categories of theories. One propounds that the Brahmi evolved from the Aramaic script (itself an evolution over the Phoenician). This is based on the proposed orthographic similarities between symbols in the scripts. (See Figure).

The other theory proposes an indigenous development of the Brahmi script, based on the wide differences in how the writing systems work. I tend to favour this theory, though I must admit that my knowledge of this area is limited to reading a few articles and knowing some of the modern day descendants of these scripts. The modern day alphabet of Indian scripts are organized phonetically, and there is little ambiguity phonetically - as opposed to the Roman scripts. The earliest Semitic scripts (Phoenician, Aramaic) and even modern Arabic do not have vowels, whereas the so called "true" alphabets Greek and its modern Latin derivative scripts still have room for ambiguity. Even if there was some use of symbols from the Aramaic scripts, the design seems pretty novel to call it a new style of scripting. Is there an alternative line of evolution of the script? The Indus Valley script is still undecipered - could the Brahmi have evolved from there?

Indian English

2012-02-12T12:15:00.000+05:30

From Chandan Mitra's weekly column in the Pioneer, some hilarious examples of English usage:

In a newspaper, describing a case of chain-snatching in which criminals shot dead the man who tried to resist and pursue the chain-snatchers, the reporter stated: “The deceased gave chase to the criminals who, however, managed to escape”!

Police notice: “Take care of belongings. You may be theft”

The article is interesting reading too.
http://dailypioneer.com/columnists/item/51044-dont-fast-you-may-be-theft-indlish-is-on-a-roll.html

Yet Another Moses Installation Guide

2012-01-14T22:35:00.001+05:30

Though Moses is a versatile MT system, its installation is still from stone age. Let me document here some of the key points to navigate through the installation of Moses. The intent is not to present a complete installation guide, but to highlight key issues that may crop up (as they cropped up for me). For a complete installation, this is probably the best guide. Another useful installation guide can be found here.

To install the Moses system, the following tools need to be installed.

Language modelling toolkit (SRILM, IRSTLM, etc.)
GIZA++ package which contains GIZA++ and mkcls
Moses decoder (version 1.0 and above)

SRILM installation

The primary installation reference is the INSTALL document that ships with the tool.
Install all pre-requisites mentioned in the SRILM installation guide. On Ubuntu I had to install the following packages: csh, g++-multilib, tcl-dev
Set the environment variable SRILM to point to the base directory of the install package before building SRILM.
Following the instruction manual with the SRILM download should be enough once the pre-requisites are installed.
The problems you may yet face are:

Problem in identifying the architecture, especially if it a 64-bit machine. To make sure that the install script correctly identifies the architecture, set the variable MACHINE_TYPE in sbin/machine-type.
Problems with TCL compilation. You may not need the TCL user interfaces at all, so it may just be able ok to disable their compilation. Set the variable NO_TCL = X in the file common/your_architecture_specific_makefile.

Make sure you have added the $SRILM/bin and $SRILM/bin/$MACHINE_TYPE to the PATH variable
Note: SRILM 1.7.1 and above are not compatible with Moses

IRSTLM installation

Ubuntu packages required: libtool make autoconf autotools-dev automake
The installation is pretty simple, just have to follow the installation guide
One caveat: Sometimes, it may be required to create a directory named 'm4' manually, if the first step mails

GIZA++ and mkcls installation

You get both if you download the giza-pp tool.
Most straightforward installation. Download and 'make'.
Copy the binaries - GIZA++, mkcls, snt2cooc.out to a new directory.

XMLRPC Server

XML RPC Server is required if you want to run a webservice providing translations. If you just want to get Moses running, you can skip this step.
Install the following packages: libxmlrpc-core-c3 libxmlrpc-core-c3-dev libxmlrpc-c3-dev libxmlrpc-c++4 libxmlrpc-c++4-dev

Boost Library
The C++ Boost library is required for installation of Moses. Boost 1.48 has a serious bug which breaks Moses compilation. Unfornately, some Linux distributions (eg. Ubuntu 12.04) have broken versions of the Boost library.To fix this situation you can:

For Ubuntu 12.04: Remove boost 1.48 from your distribution and install Boost 1.46 which is available in the distribution. This works most of the time. If not, build Boost from source as described below.
To install Boost manually and making it work with Moses, follow the instructions in the section titled "Manually Installing Boost" on this page: http://www.statmt.org/moses/?n=Development.GetStarted

Moses installation

The primary installation reference is the INSTALL document that ships with the tool.
SRILM or IRSTLM need to be installed before Moses is installed
Make sure you have installed the packages automake and libtool
Boost has to be installed
It is then a matter of just following the instructions. The command to be run is:
/usr/bin/bjam --with-srilm= --with-xmlrpc-c= --with-boost=

If the xml RPC is installed in /usr/bin, then the parameter would simply be '/usr'
--with-boost is required only when Boost is installed in a non-standard directory. The path should contain both lib/lib64 and include directories

Now Moses is ready to cross the Red Sea.

Alternative ways of installation Moses

If you fail to install from the source as mentioned above, then there are a couple of simpler alternatives you can try:

One, use the pre-compiled binaries provided by the Moses team:

http://www.statmt.org/moses/?n=Moses.Releases

The pre-compiled version comes with IRSTLM and does not support XML-RPC to the best of my knowledge. However, it is handy to get started.

If that too runs into trouble, then you can try using the virtual machine provided by the Moses team.

http://www.statmt.org/moses/RELEASE-2.1/vm/

If you are using Virtual Box, you can import the OVA images into VirtualBox.

This guide many be useful for importing OVA images into VirtualBox:
http://www.maketecheasier.com/import-export-ova-files-in-virtualbox/

I have not tried the Virtual Images, so let me know if it works.

Incorporating Linguistic Information into SMT Models

2011-09-23T20:17:00.000+05:30

(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook 'Statistical Machine translation')

Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like:

Name Transliteration and Number script conversions
Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems
Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well.

One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:

Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.
Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly.
Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based.

Similarly, some work may be done on the post processing side:

If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.
Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems.

The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind the factored translation model. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors). Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.

Language Divergence between English and Hindi

2011-09-23T18:09:00.000+05:30

Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper, 'Interlingua-based English–Hindi Machine Translation and Language Divergence', summarizes the major differences between Hindi and English.

I have tried to tabulate the observations in the paper below, to make a handy reference:

Factor	English	Hindi

Word Order	Subject-Verb-Object	Subject-Object-Verb
	Ram ate* the mango*	राम ने आम खाया

Modifiers	Post modifier	Premodifier
	The Prime Minister of India	भारत का प्रधान मंत्री
	play well	अच्छे से खेलेंगे

X-positions	Prepositions	Postpositions
	of India	भारत का
	Overloading
	John ate rice with curd
	John ate rice with a spoon

Compound Verbs	not prevelant	very common

Conjunct Verbs	not prevelant	very common
		वह गाने लगे
		रुक जाओ

Respect	No special words	Words indicating respect
		आप, हम

Person		Uses 2nd person for 3rd person
	He obtained his degree	आपने अम्रीका से डिग्री प्राप्त की

Gender	Masculine, feminine, neuter	Masculine, feminine

Gender specific possesive pronouns	English has them	Hindi lacks them
	he, she	वह

Morphology	Poor	Rich

Null subject divergence		Subject dropped in certain conditions
	There was a king	एक राजा था
	I am going	जा रहा हूँ

Pleonastic divergence		Pleonastic dropped
	It is raining	बारिश हो रही है

Conflational divergence		no appropriate word
	Brutus stabbed Caesar	ब्रूटस ने सीसर को छुरे से मारा

Categorical divergence		change in POS category
	They are competing	वे मुकाबला कर रहे है

Head swapping		Head and modifier are exchanged
	The play is on	खेल चल रहा है

Aligning Sentences to build a parallel corpus

2011-09-21T21:48:00.000+05:30

This is a really old paper, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned.

The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a random variable X: the number of charters in language L2 per character or language L1.

I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results:

mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96.

variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this:

NLTK provides an implementation of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on?

Watson - The Quiz Champion

2011-08-31T10:00:00.004+05:30

You must have heard of IBM's Watson system. It is, of course, the computer that won the Jeopardy competition against the show's previous champions. Jeopardy is a popular quiz show in which the competitors are provided clues and have to give questions that satisfy these clues. For example, a clue like 'This computer beat the reigning world chess champion' would elicit a question 'Who is Deep Blue?'. As you can see, the questions given by the competitors are easy questions of the nature 'What is', 'Who is', so the Jeopary question answer format can be considered like any other quiz show. The clues however are complex covering a wide array of topics, and could include puns, puzzles, and maths. The competitors also place bets on each questions. Competing at 'Jeopardy' thus requires the right combination of 'natural language understanding, broad knowledge, confidence and strategy'.

Watson's victory thus represents a major milestone for natural language processing, and particularly the sub-area known as 'Question-Answering'. Question-Answering systems have great practical use for building expert systems, customer support system, decision making tools, enterprise search systems.

Watch Watson's winning performance here:

This paper, Building Watson: An Overview of the DeepQA project, from IBM provides an overview of Watson and the DeepQA architecture that underlies it. The DeepQA architecture defines a framework for development of QA systems in an extensible and modular method, allowing different components to be customized, and to build robust QA systems that can be ported across domains. Figure 1 shows a high level diagram of the Watson's major components, and how queries are routed through it.

Query Analysis: This is the first stage, where the input clue is analyzed to determine the question category (puzzle, pune, mathematical, numeric, logical, etc.) and the answer type (person, location, organization, etc.). Complex clues are also decomposed into simpler clues.

Hypothesis Generation: Watson has at its disposal many sources of information like encyclopedias, books, lists of things like people, countries, etc. Watson does not attempt to get the correct answer straightaway. Instead, it first focusses on generating as many possible candidate answers, called 'hypotheses'. This is to ensure that good answers are not missed in the pursuit of the perfect answer. The attempt is to increase recall at this stage.

Soft Filtering: Watson may generate hundreds and thousands of hypotheses, which then have to be analyzed in detail to find the correct answer. To limit this deep analysis to only the most relevant answers, Watson filters out the bad candidates by employing a few techniques like mismatch between the expected and candidate answer type.

Hypothesis and Evidence scoring: Now Watson does a deep analysis of the candidate answers by employing sophisticated linguistic and statistical techniques, and looks to gather evidence for each hypothesis. This is one of the most critical parts of Watson since the evidence collected will determine how good the answer is and how confident Watson can be about it.

Merging and Ranking: Once the evidence is collected, the confidence scores are generated for each candidate and candidates ranked. Now, looking at the answer's confidence level Watson decides if it should answer the question or not.

Figure 1: DeepQA Architecture (Source: The IBM paper)

The flexibility in the DeepQA architecture is achieved through the use of the UIMA text analysis framework. At one point in the trials, Watson was taking about two hours to generate an answer. The answer was to parallelize Watson with UIMA-AS and this got the response time down to the quiz show's average of 2 to 5 seconds. The improvement in accuracy is even more startling. When the IBM team stared working on Watson, the difference between the show's participants and early prototypes of Watson was huge. Figure 2 depicts the evolution in Watson's performance. It started from the baseline where the precision and recall were nowhere near the cloud of points corresponding to actual human competitors, but gradually reached human level performance.

Figure 2: Watson's accuracy over time (Source: The IBM paper)

What enabled Watson to reach this level of performance? Many of the underlying analysis algorithms aren't new, but have been around in the research community for a long time. More than groundbreaking original research, it is pragmatic engineering that lies at the core of Watson's success and the following are the salient contributory factors:

Building an end-to-end system: Very early, the team build a baseline end-to-end system and then kept iterating and improving the system. They defined end-to-end evaluation metrics which captured the performance of the system as a whole, and not focusing only on the individual component accuracies at the initial stages. This helped make the correct trade-offs.

Pervasive Confidence estimation: Every component in Watson gives a confidence estimate along with its response. This is critical since these confidence scores can be aggregated to get the final confidence on the answers and allows easy integration of components of varying accuracy. The rule is that no component is assumed to be perfect, but each makes available its confidence estimate of the answers.

Many experts: There may be competing algorithms to do the same task. Rather than using the best, the system uses multiple algorithms so as to get diverse results and evidence. The confidence estimates help to blend the diverse results.

Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

Massive parallelism: As mentioned, exploiting massive parallelism allows looking through a large number of hypotheses.

(PS: Cross-posted from my Peepaal blog post)

Statistical Machine Translation - IBM Models

2011-07-20T19:38:00.005+05:30

At CFILT, a few of us have been working on understanding the IBM Models thoroughly. The IBM paper on SMT is a classic and seminal paper in the history of Machine Translation, and a must read for anybody wanting to work in this area. Its not an easy read, and we spent quite a lot of time figuring out how the estimation results are derived. Some notes sprung out of working for this discussion, and works out the steps missing in the original paper in detail. Hopefully it will be useful for everybody. These scanned notes of estimation for Model 1 and Model 2 can be found here. This is not a replacement for the original paper, but is just meant to supplement the reading of the original paper. Thanks to Mitesh for helping out with the key steps in the derivation.

You can find the notes here

Update: Finally I have created a PDF of the notes for Model 1 derivation. You can find them here. A few slides introducing SMT can be found here.

Beauty of Language

2010-12-21T22:46:00.002+05:30

Language is so ambiguous, and hence so difficult to analyze. I came across an extreme example the other day, which is kind of representative of the ambiguity in dealing with language. The following sentence can have different meanings depending upon how it is spoken:

I didn't say he stole the money.

The change in meaning comes from variation in which word is given stress while speaking. Here are a few interpretations of the sentence, with the word being given stress in bold.

I didn't say he stole the money
... some else may have said it

I didn't say he stole the money
... the literal meaning

I didn't say he stole the money
... just hinted, implied ??

I didn't say he stole the money
... i didn't mean him

I didn't say he stole the money
... may he just borrowed it, with the intention of returning it

I didn't say he stole the money

... not that money

I didn't say he stole the money

... not the money, I mean something else - xyz ...

Most common situations may not be that extreme, but just serves to highlight the challenges to understand text, and currently the state-of-the-art is just skimming the surface.

PS: Cross-posted from my Peepaal blog post

Scalable Machine Learning - Apache Mahout

2010-01-26T13:55:00.003+05:30

Machine learning algorithms are pretty computationally intensive, work on huge amounts of data and take a lot of time to run. That makes them obvious candidates for running on data parallel distributed programming models like Map-Reduce.

Although Google's Map-Reduce paper does talk about it, there was not much available in the public domain to do machine learning on a distributed scale. Andrew Ng's paper gives a common mathematical framework for modeling the most common machine learning algorithms, so that they can be parallelized. Its basically built around the idea of representing computations as summations of simpler computations. Each computation can be a map task, with the final summation being the reduce task.

Apache Mahout is a project from the Apache Foundation, that started off with Ng's paper and already have implementations for many ML algorithms running on Hadoop. In addition, Mahout also contains the Taste library for building recommendation systems and collaborative filtering systems.

Hoping to read more on open source ML and practical ML. A couple of books I am looking forward to reading:

Programming Collective Intelligence, Toby Seagaran
Taming Text, Grant S. Ingersoll and Thomas S. Morton

Book Review: The Numerati

2009-08-11T05:38:00.000+05:30

With the advent of the Web and the fall in electronic prices, we have seen an explosion in digital data in the form of huge databases collecting various pieces of information to ever larger collection of documents. The Numerati (a portmanteau between the Number and Illuminati) are the statisticians, mathematicians, computer scientists, linguists and others involved in making sense of this data using sophisticated statistical techniques. The book describes the kind of problems being solved in the following areas, citing various examples at a bunch of organizations like IBM, Intel, Umbria, etc.:

Workers - building employee profiles, understanding employee networks, using it for optimal use of resources
Shoppers - microtargeting shoppers using personal information to customize service, give recommendations and increase sales
Voters - Understanding voter intent, issues - so that campaign messages can be targeted to focussed groups.
Bloggers - Understanding public opinion from the information on blogosphere, useful to understand sentiments on products, etc.
Medicine - Baker focusses on futuristic health monitoring (like floor tiles which capture your walking patterns!), whereaas he totally ignores contemporary challenges and work in analyzing medical records, genomic and proteomic data.
Terrorism
Match Making

All this comes at a cost. The Numerati has access to vast amounts of personal data, and we don't need an Orwellian Big Brother who is going to use it to learn about us, turn us into commodities and control our lives.

That's about it in the book - it can be a brisk read, which - you can give it a miss if you think you are familiar with the above topics.

Book Review: The Lady Tasting Tea

2009-08-11T05:36:00.000+05:30

A lady claims that the taste of tea differs when milk is poured to tea leaves as opposed to adding tea leaves into a cup of milk. Everyone at the small party scoffs at the suggestion, except Ronald Aylmer Fisher. Fisher designs an experiment that would statistically establish the lady's claims. He creates a sample set containing tea prepared in either ways, and lo and behold - the story goes that the lady identifies each cup correctly. Fisher uses this example to explain the design of experiments in his book 'The Design of Experiments'. This anecdote sets up the book. 'The Lady Tasting Tea' is the story of the development of statistics, Fisher having built the pillars of statistics as it stands today.

I started reading this book, while looking around to brush my statistics; thought it would be a good idea to know the history of the subject I am exploring. That's particularly relevant in sciences filled with uncertainties like statistics, economics, linguistics; where the characteristics of the individual seem to contribute to the development of the theory, and there's a story behind things which seem arbitrary.

David Salsburg takes us through an entertaining journey starting with the earliest breakthroughs by Karl Pearson and William Gossett, going to the pioneering foundational works of the acerbic genius Ronald Fisher, the cheerful Jerzy Newman, and the multitalented Andrei Kolmogorov. Apart from these pioneers, Salsburg very vividly sketches the lives and contributions of Egon Pearson (hypothesis testing), Chester Bliss (probit analysis), John Tukey (exploratory data analysis), Frank Wilcoxon (non-parametric methods), EJG Pitman (non-parametric methods), Prasanta Chandra Mahalabonis (sampling theory), Samuel Wilks (Founder - Statistical Research Group, Princeton) , George Box (robust statistics) and Edward Deming (statistical quality control).

Some of the chapter names are interesting, and they are as good as the title of the book. It reminds me of 'The Mythical Man Month''s memorable illustrative sketches. Sample this:

The Mozart of Mathematics - Andrei Kolmorogov
The Picasso of Statistics - John Tukey
The March of the Martingales - on the work of Paul Levy

Read this if you are a fan of scientific history.

Text Engineering Frameworks

2009-05-02T17:08:00.007+05:30

What is a text engineering framework?

With the volume of unstructured text going through the roof, and the need to make sense of them, so are the efforts to analyze them. Different software tools for language analysis and data mining, attacking myriad language analysis problems have been developed. While each system concentrates on solving the problem at hand, there remains the enviable task of gluing together these language technologies. All language technologies need to worry about common problems like representation of data and metadata, modularization of the software components, and interaction between them.

Each system takes its own approach to handling these problems, in addition to solving the central problem. This is where a text engineering system steps in. What a text engineering framework provides is an architecture and out-of-the-box support for rapid development of highly modularized, scalable language technology components, which can interface with other components - thus improving the process of creating language technology applications. The framework does all the plumbing necessary to create interesting language technology applications. A good analogy would be that the framework is the OS platform on which applications are built.

Architecture of a Text Engineering Framework

While different systems may have their own architectures, the generic architecture described here is the one that forms the basis of the two most popular text engineering frameworks, GATE (General Architecture for Text Engineering) and UIMA (Unstructured Information Management Access). The two key services that the framework provides are: data/metadata management services and analysis component development services.

Data Management Services

The most important problem facing NLP tools is the management of data, hence the representation of data is given a central importance in the framework. The basic unit of unstructured data to be analyzed is a Document. This corresponds to a single artifact to be analyzed like a single medical report, a news article, etc. The unstructured data need not be restricted to text, but it could be audio, video and other multimedia data. The focus of this article would be text, but most the concepts elaborated here would apply to other media too. In NLP applications, it is common to process large collections of documents for analysis. The framework represents a collection of Documents by a Corpus abstraction.

Each NLP tool generates metadata for the Document. For instance, a tokeniser would generate tokens, a POS tagger would generate Part-Of-Speech tags for each token, a noun phrase chunker would identify noun phrase chunks and a named entity recognizer would generate labels for chunks of text. There needs to be a consistent method to represent all this metadata. This is achieved by using an Annotation object, which represents metadata associated with a contiguous chunk of text. To illustrate the idea, consider the following sentence:
"In a perfect world, all the people would be like cats are, at two o'clock in the afternoon."

The tokenizer would identify tokens, each token like "perfect" represented by an Annotation, whose type is "Token". Each annotation has a start and end offset associated with it, which identifies its position in the Document. Information about the annotation can be stored in a key-value pairs called Features. This allows arbitrarily complex data to be associated with the annotation. For instance, the Token annotation could have a "string" feature to represent the text of the token, a "kind" feature to indicate if the token is a word, number, or punctuation, a "root" feature which contains its morphological root.

The scheme of representing metadata described above allows different kinds of metadata from different NLP components to be accessed and manipulated using the same interface. Positional information about the metadata can be captured, and arbitrarily complex data can be associated - since the feature values could be complex objects themselves. Annotations can be added at various levels of detail to the same chunk of text. For instance, the phrase "a perfect world" can have "Token" annotations for each token, "POS" annotations to represent part-of-speech information for each token, "NP" annotation over the entire phrase to represent a noun phrase chunk. I

It should now be obvious that the annotations constitute a data exchange format between various NLP components, to build more complex analysis of the text. An entire declarative type system can be built using these annotations for an application, as is done in UIMA. It is possible to do pattern matching over these annotations, as provided by the JAPE language in GATE. The frameworks provide implementations of these abstractions, thus freeing applications from the data management chores.

The architecture decribed above evolved during the TIPSTER conferences . One of the popular ways of serializing this data is the XML stand-off markup, which separates the annotation metadata from the data.

Text Analysis Development Services

NLP applications generally consist of a number of steps, each doing some part of the analysis, building upon the analysis done in the previous stage. To support this application development paradigm, the framework represents e ach NLP task by a processing resource (PR). The PR is a component which performs a single task like tokenizing, POS tagging, or something even simpler like mapping one set of annotations to another (for adaption purposes). The data interface to the PR is specified by the kind of input annotations that it requires, and the annotations it generates. For instance, the POS tagger requires "Token" annotation as input and generates "POS" annotation as output. The PR's role can be more accurately characterized as an annotator. Each PR is a reusable software component, that can be used in a creating NLP applications. The same POS tagger can be used in different applications as long as its input and output requirements are satisfied. A number of PRs can be strung together to create a pipeline. A example of an NP-chunking pipeline is shown below.

This pipeline is a sequential pipeline, but you can as well imagine conditional, looped and other pipeline configurations. The scheme described above constitutes a modular, loosely-coupled architecture for a text engineering application. Each PR in the pipeline may be replaced by an equivalent PR as long as it satisfies the data interface requirements, allowing you to test different configurations. The framework defines the common interfaces for PRs, provides different pipeline implementations and allows for declarative specification of PRs and pipelines. In a nutshell, the framework provides all the plumbing required to build an NLP application, while the developer can focus on developing the smart innovations.

Other facilities provided by the framework

For making the application development easier,
1. The framework provides visual tools for managing language resources, creating pipelines, running applications, observing annotations, editing annotations, creation of training sets.
2. The framework may ship with off-the-shelf components for common NLP tasks like tokenization, sentence identification, dictionary lookups, POS tagging, machine learning interfaces, etc. This allows rapid prototyping of applications , using these ready-to-use components. GATE, for example, ships with the ANNIE toolkit.
3. The framework developers maintain a component repository, which allow the developer community to share the resusable PRs that are developed, and make use of the work done by others.

In summary, if you are developing NLP applications you should use a text engineering framework to make use of the wealth of components that have been developed, increase productivity and build NLP applications which are modular and loosely coupled.

De-Identification of Personal Health Information

2009-04-26T00:06:00.005+05:30

I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.

Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The HIPAA regulation lays down the rules for the handling of PHI.

Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.

There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (Entire list here). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.

A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this paper. Here are a few de-identification systems that are available:

PhysioNet DeId (Open Source)
Harvard Medical School Scrubber (Open Source)
Data Corp DeId (Commercial)

For research purposes, a gold standard data set containing surrogate PHI data is available on the PhysioNet page.

Yet Another Blog On Organizing Information

2009-04-25T22:39:00.002+05:30

Data and information everywhere. The digital age is generating so much information, that it has fast outgrown our ability to comprehend it. 'Information Overload', we call it. These are the questions that are posed to us:

How do I find information that I want?
What information is relevant to my need?
Ok, this is way too much information than I can handle. I would like to have summary of the same.
In this huge infobase, is there some useful information that isn't obvious? Some patterns, trends that may be useful.
There are a lot of smart people generating content. How can the collective intelligence of these people augment my search for information?

These questions have had us hooked for a long time, and so have the solutions people have developed to tackle these questions. Search engines to help you find information, business intelligence tools to make find patterns in huge volumes of data, information extraction systems to summarize information in human generated content, recommendation systems to bring information relevant to your need and study of social networks to harness the "collective intelligence" of the crowd.

The rabbit hole goes deeper. These solutions are built on the more fundamental sciences of statistics, pattern recognition, artificial intelligence and natural language understanding.

This is not the end, for the more fundamental questions we are posed with are about the nature of cognition, the understanding of language, the organization of the knowledge and the active role of the human observer in the perception of information. I think this is the holy grail that we are all in pursuit of.

We are beginners in this exciting field,. This is a place to share what we learn, what we do and to benefit from the "collective intelligence" of all who visit this page.

While the challenges span many problems, there are some that we are currently working on. Dhaval currently works on optimizing ad-networks and takes an active interest in search engines. I currently work on information extraction from text and medical informatics. So for now you may find a certain bias towards these topics, and related topics on this blog.