<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-1879978874853957111</id><updated>2012-02-17T08:06:37.665+05:30</updated><category term='installation'/><category term='books'/><category term='text_engineering'/><category term='watson'/><category term='language'/><category term='india'/><category term='text_mining'/><category term='machine_translation'/><category term='book-review'/><category term='linguistics nlp machine_translation language divergence'/><category term='medical_informatics'/><category term='text_mining language peepaal'/><category term='jeopardy'/><category term='UIMA'/><category term='SMT IBM model1 model2'/><category term='smt'/><category term='history'/><category term='deepqa'/><category term='moses'/><category term='GATE'/><category term='question_answering'/><category term='smt factored_model'/><category term='statistics'/><category term='nlp alignment'/><category term='parallel_programming machine_learning distributed_programming apache mahout'/><title type='text'>On Organizing Information</title><subtitle type='html'>About making sense of unstructured information</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>14</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-9125210774220028543</id><published>2012-02-12T12:15:00.000+05:30</published><updated>2012-02-12T12:15:28.897+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='india'/><category scheme='http://www.blogger.com/atom/ns#' term='language'/><title type='text'>Indian English</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;From Chandan Mitra's weekly column in the Pioneer, some hilarious examples of English usage:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In a newspaper, describing a case of chain-snatching in which criminals shot dead the man who tried to resist and pursue the chain-snatchers, the reporter stated: “The deceased gave chase to the criminals who, however, managed to escape”!&lt;br /&gt;&lt;br /&gt;Police notice: “Take care of belongings. You may be theft”&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The article is interesting reading too.&lt;br /&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-family: arial, sans-serif; font-size: 13px;"&gt;&lt;a href="http://dailypioneer.com/columnists/item/51044-dont-fast-you-may-be-theft-indlish-is-on-a-roll.html" style="color: #1155cc;" target="_blank"&gt;http://dailypioneer.com/&lt;wbr&gt;&lt;/wbr&gt;columnists/item/51044-dont-&lt;wbr&gt;&lt;/wbr&gt;fast-you-may-be-theft-indlish-&lt;wbr&gt;&lt;/wbr&gt;is-on-a-roll.html&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-9125210774220028543?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/9125210774220028543/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2012/02/indian-english.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/9125210774220028543'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/9125210774220028543'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2012/02/indian-english.html' title='Indian English'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-8642583263368368566</id><published>2012-01-14T22:35:00.001+05:30</published><updated>2012-01-14T22:35:49.779+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='smt'/><category scheme='http://www.blogger.com/atom/ns#' term='installation'/><category scheme='http://www.blogger.com/atom/ns#' term='moses'/><category scheme='http://www.blogger.com/atom/ns#' term='machine_translation'/><title type='text'>Yet Another Moses Installation Guide</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="text-align: justify;"&gt;Though Moses is a versatile MT system, its installation is still from stone age. Let me document here some of the key points to navigate through the installation of Moses. The intent is not to present a complete installation guide, but to highlight key issues that may crop up (as they cropped up for me). For a complete installation, &lt;a href="http://www.statmt.org/moses_steps.html"&gt;this&lt;/a&gt; is probably the best guide. Another useful installation guide can be found &lt;a href="http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf"&gt;here&lt;/a&gt;. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;To install the Moses system, the following tools need to be installed.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Language modelling toolkit (SRILM, IRSTLM, etc.)&lt;/li&gt;&lt;li&gt;GIZA++ package which contains GIZA++ and mkcls&lt;/li&gt;&lt;li&gt;Moses decoder&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;SRILM installation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;The primary installation reference is the INSTALL document that ships with the tool.&lt;/li&gt;&lt;li&gt;Install all pre-requisites mentioned in the SRILM installation guide. On Ubuntu I had to install the following packages: csh, g++-multilib, tcl-dev&lt;/li&gt;&lt;li&gt;Set the environment variable SRILM to point to the base directory of the install package before building SRILM.&lt;/li&gt;&lt;li&gt;Following the instruction manual with the SRILM download should be enough once the pre-requisites are installed. &amp;nbsp; &amp;nbsp;&lt;/li&gt;&lt;li&gt;The problems you may yet face are:&amp;nbsp;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Problem in identifying the architecture, especially if it a 64-bit machine. To make sure that the install script correctly identifies the architecture, set the variable MACHINE_TYPE in sbin/machine-type.&lt;/li&gt;&lt;li&gt;Problems with TCL compilation. You may not need the TCL user interfaces at all, so it may just be able ok to disable their compilation. Set the variable NO_TCL = X in the file common/your_architecture_specific_makefile.&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Make sure you have added the $SRILM/bin and $SRILM/bin/$MACHINE_TYPE to the PATH variable&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;GIZA++ and mkcls installation&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;You get both if you download the giza-pp tool.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Most straightforward installation. Download and 'make'.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Moses installation&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;The primary installation reference is the INSTALL document that ships with the tool.&lt;/li&gt;&lt;li&gt;SRILM or IRSTLM need to be installed before Moses is installed&lt;/li&gt;&lt;li&gt;Make sure you have installed the packages &amp;nbsp;automake and libtool&lt;/li&gt;&lt;li&gt;It is then a matter of just following the instructions. If you have not yet figured it out yet, the Build instructions don't mention that you need to run 'make' after 'configure' :)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Putting it all together&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Now, we need to get all these tools to work together.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The first step is to create a directory 'bin' in which all GIZA++ binaries are copied to - GIZA++, mkcls, snt2cooc.out&lt;/li&gt;&lt;li&gt;Next, you need to tell Moses where these GIZA++ binaries are and 'release' the scripts distributed with Moses. For this go to $MOSES_ROOT/scripts/Makefile.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Set TARGETDIR to the directory where you want the Moses scripts to be installed.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Set BINDIR variable to the directory where the GIZA++ variables are extracted. &amp;nbsp;&amp;nbsp;&lt;/li&gt;&lt;li&gt;Do a 'make' in the $MOSES_ROOT/scripts directory. This should install the scripts.&amp;nbsp;&lt;/li&gt;&lt;li&gt;export SCRIPTS_ROOTDIR=(dir_where_moses_scripts_were_installed)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Now Moses is ready to cross the Red Sea.&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-8642583263368368566?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/8642583263368368566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2012/01/yet-another-moses-installation-guide.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8642583263368368566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8642583263368368566'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2012/01/yet-another-moses-installation-guide.html' title='Yet Another Moses Installation Guide'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-7459836301168723951</id><published>2011-09-23T20:17:00.000+05:30</published><updated>2011-09-23T20:17:42.474+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='smt factored_model'/><title type='text'>Incorporating Linguistic Information into SMT Models</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook &lt;a href="http://www.statmt.org/book/"&gt;'Statistical Machine translation'&lt;/a&gt;)&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like:&amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li style="text-align: justify;"&gt;Name Transliteration and Number script conversions&lt;/li&gt;&lt;li style="text-align: justify;"&gt;Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems&lt;/li&gt;&lt;li style="text-align: justify;"&gt;Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:&lt;/div&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li style="text-align: justify;"&gt;Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.&lt;/li&gt;&lt;li style="text-align: justify;"&gt;Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly.&amp;nbsp;&lt;/li&gt;&lt;li style="text-align: justify;"&gt;Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Similarly, some work may be done on the post processing side:&amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li style="text-align: justify;"&gt;If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.&lt;/li&gt;&lt;li style="text-align: justify;"&gt;Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and&amp;nbsp;necessitates&amp;nbsp;the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind &lt;a href="http://acl.ldc.upenn.edu/D/D07/D07-1091.pdf"&gt;the factored translation model&lt;/a&gt;. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors). &amp;nbsp;Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-7459836301168723951?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/7459836301168723951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2011/09/incorporating-linguistic-information.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/7459836301168723951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/7459836301168723951'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2011/09/incorporating-linguistic-information.html' title='Incorporating Linguistic Information into SMT Models'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-2458068278522634114</id><published>2011-09-23T18:09:00.000+05:30</published><updated>2011-09-23T18:10:00.738+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics nlp machine_translation language divergence'/><title type='text'>Language Divergence between English and Hindi</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper,&amp;nbsp;&lt;a href="http://www.springerlink.com/content/t1005w166746727l/"&gt;'Interlingua-based English–Hindi Machine&amp;nbsp;Translation and Language Divergence'&lt;/a&gt;, summarizes the major differences between Hindi and English.&lt;br /&gt;&lt;br /&gt;I have tried to tabulate the observations in the paper below, to make a handy reference:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table cellspacing="0" cols="3" frame="VOID" rules="NONE"&gt;	&lt;colgroup&gt;&lt;col width="230"&gt;&lt;/col&gt;&lt;col width="357"&gt;&lt;/col&gt;&lt;col width="373"&gt;&lt;/col&gt;&lt;/colgroup&gt;	&lt;tbody&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="230"&gt;&lt;b&gt;Factor&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="357"&gt;&lt;b&gt;English&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="373"&gt;&lt;b&gt;Hindi&lt;/b&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Word Order&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Subject-Verb-Object&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Subject-Object-Verb&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;Ram &lt;b&gt;ate&lt;/b&gt; the mango&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;राम ने आम &lt;b&gt;खाया &lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Modifiers&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Post modifier&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Premodifier&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;The Prime Minister of India&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;भारत का प्रधान मंत्री &lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;play well&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;अच्छे से खेलेंगे&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;X-positions&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Prepositions&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Postpositions&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;of India &lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;भारत का&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Overloading&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;John ate rice with curd&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;John ate rice with a spoon&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Compound Verbs&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;not prevelant&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;very common&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Conjunct Verbs&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;not prevelant&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;very common&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;वह गाने लगे&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;रुक जाओ&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Respect&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;No special words&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Words indicating respect&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;आप, हम&amp;nbsp;&lt;/span&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="18" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Person&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Uses 2nd person for 3rd person&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;He obtained his degree&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;आपने&amp;nbsp;&amp;nbsp;अम्रीका से डिग्री प्राप्त की&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Gender&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Masculine, feminine, neuter&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Masculine, feminine&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Gender specific possesive pronouns&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;English has them&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Hindi lacks them&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;he, she&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;वह &lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Morphology&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Poor&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Rich&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Null subject divergence&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Subject dropped in certain conditions&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;There was a king&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;एक राजा था &lt;/span&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;I am going&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;जा रहा हूँ&amp;nbsp;&lt;/span&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Pleonastic divergence&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Pleonastic dropped&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;It is raining&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;बारिश हो रही है&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Conflational divergence&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;no appropriate word&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;Brutus stabbed Caesar&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;ब्रूटस&amp;nbsp; ने सीसर को छुरे से मारा&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Categorical divergence&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;change in POS category&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;They are competing&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;वे मुकाबला कर रहे है &lt;/span&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;Head swapping&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;br /&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;Head and modifier are exchanged&lt;/td&gt;		&lt;/tr&gt;&lt;tr&gt;			&lt;td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;The play is on&lt;/i&gt;&lt;/td&gt;			&lt;td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"&gt;&lt;i&gt;&lt;span style="font-family: 'Lohit Hindi';"&gt;खेल चल रहा है &lt;/span&gt;&lt;/i&gt;&lt;/td&gt;		&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-2458068278522634114?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/2458068278522634114/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2011/09/language-divergence-between-english-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2458068278522634114'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2458068278522634114'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2011/09/language-divergence-between-english-and.html' title='Language Divergence between English and Hindi'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-8734316555676814809</id><published>2011-09-21T21:48:00.000+05:30</published><updated>2011-09-21T21:48:22.839+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='nlp alignment'/><title type='text'>Aligning Sentences to build a parallel corpus</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="text-align: left;"&gt;This is a &lt;a href="http://dl.acm.org/citation.cfm?id=972455"&gt;really old paper&lt;/a&gt;, from Gale &amp;amp; Church, on building a sentence aligned parallel corpus from a misaligned corpus.&amp;nbsp;A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a &amp;nbsp;random variable X: the number of charters in language L2 per character or language L1.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results:&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: left;"&gt;variance(X):&amp;nbsp;0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this:&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-5uoTelxLbP4/TnoOBgTAWOI/AAAAAAAAB3g/9MirE5T6MuE/s1600/Screenshot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="292" src="http://4.bp.blogspot.com/-5uoTelxLbP4/TnoOBgTAWOI/AAAAAAAAB3g/9MirE5T6MuE/s320/Screenshot.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;a href="http://code.google.com/p/nltk/source/browse/trunk#trunk%2Fnltk_contrib%2Fnltk_contrib%2Falign"&gt;NLTK provides an implementation&lt;/a&gt; of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on?&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-8734316555676814809?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/8734316555676814809/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2011/09/aligning-sentences-to-build-parallel.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8734316555676814809'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8734316555676814809'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2011/09/aligning-sentences-to-build-parallel.html' title='Aligning Sentences to build a parallel corpus'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-5uoTelxLbP4/TnoOBgTAWOI/AAAAAAAAB3g/9MirE5T6MuE/s72-c/Screenshot.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-4772151305669499846</id><published>2011-08-31T10:00:00.004+05:30</published><updated>2011-08-31T10:31:29.139+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='watson'/><category scheme='http://www.blogger.com/atom/ns#' term='jeopardy'/><category scheme='http://www.blogger.com/atom/ns#' term='text_mining'/><category scheme='http://www.blogger.com/atom/ns#' term='deepqa'/><category scheme='http://www.blogger.com/atom/ns#' term='question_answering'/><title type='text'>Watson - The Quiz Champion</title><content type='html'>&lt;p style="text-align: justify;"&gt;You must have heard of IBM's Watson system. It is, of course, the computer that won the Jeopardy competition against the show's previous champions. Jeopardy is a popular quiz show in which the competitors are provided clues and have to give questions that satisfy these clues. For example, a clue like '&lt;em&gt;This computer beat the reigning world chess champion&lt;/em&gt;' would elicit a question '&lt;em&gt;Who is Deep Blue?&lt;/em&gt;'. As you can see, the questions given by the competitors are easy questions of the nature '&lt;em&gt;What is&lt;/em&gt;', '&lt;em&gt;Who is&lt;/em&gt;', so the Jeopary question answer format can be considered like any other quiz show. The clues however are complex covering a wide array of topics, and could include puns, puzzles, and maths. The competitors also place bets on each questions. Competing at 'Jeopardy' thus requires the right combination of 'natural language understanding, broad knowledge, confidence and strategy'.  &lt;/p&gt;&lt;p style="text-align: justify;"&gt;Watson's victory thus represents a major milestone for natural language processing, and particularly the sub-area known as 'Question-Answering'. Question-Answering systems have great practical use for building expert systems, customer support system, decision making tools, enterprise search systems. &lt;/p&gt;&lt;p style="text-align: justify;"&gt;Watch Watson's winning performance here: &lt;/p&gt;&lt;iframe width="420" height="345" src="http://www.youtube.com/embed/qpKoIfTukrA?wmode=opaque" frameborder="0"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;iframe width="560" height="345" src="http://www.youtube.com/embed/YLR1byL0U8M?wmode=opaque" frameborder="0"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p style="text-align: justify;"&gt;This paper, &lt;em&gt; &lt;/em&gt;&lt;a href="http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf" target="_blank"&gt;Building Watson: An Overview of the DeepQA project&lt;/a&gt;, from IBM provides an overview of Watson and the DeepQA architecture that underlies it. The DeepQA architecture defines a framework for development of QA systems in an extensible and modular method, allowing different components to be customized, and to build robust QA systems that can be ported across domains. Figure 1 shows a high level diagram of the Watson's major components, and how queries are routed through it.&lt;/p&gt;&lt;ol&gt;&lt;li style="text-align: justify;"&gt;&lt;strong&gt;Query Analysis&lt;/strong&gt;: This is the first stage, where the input clue is analyzed to determine the question category (puzzle, pune, mathematical, numeric, logical, etc.) and the answer type (person, location, organization, etc.). Complex clues are also decomposed into simpler clues. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;&lt;strong&gt;Hypothesis Generation&lt;/strong&gt;: Watson has at its disposal many sources of information like encyclopedias, books, lists of things like people, countries, etc. Watson does not attempt to get the correct answer straightaway. Instead, it first focusses on generating as many possible candidate answers, called 'hypotheses'. This is to ensure that good answers are not missed in the pursuit of the perfect answer. The attempt is to increase recall at this stage. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;&lt;strong&gt;Soft Filtering:&lt;/strong&gt; Watson may generate hundreds and thousands of hypotheses, which then have to be analyzed in detail to find the correct answer. To limit this deep analysis to only the most relevant answers, Watson filters out the bad candidates by employing a few techniques like mismatch between the expected and candidate answer type. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;&lt;strong&gt;Hypothesis and Evidence scoring:&lt;/strong&gt; Now Watson does a deep analysis of the candidate answers by employing sophisticated linguistic and statistical techniques, and looks to gather evidence for each hypothesis. This is one of the most critical parts of Watson since the evidence collected will determine how good the answer is and how confident Watson can be about it. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;&lt;strong&gt;Merging and Ranking:&lt;/strong&gt; Once the evidence is collected, the confidence scores are generated for each candidate and candidates ranked. Now, looking at the answer's confidence level Watson decides if it should answer the question or not. &lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;p&gt;&lt;a href="http://api.ning.com/files/Iyak1H5Usv*ZxLRFvSNVrz-N9VCzet5yUboI6L0ZbUNHv-GhmwvoCsrdFwtv4YUCukQpoKd8JjQSLZP2Y2UqMPIL4m9sTOqs/deepQA.png" target="_self"&gt;&lt;img src="http://api.ning.com/files/Iyak1H5Usv*ZxLRFvSNVrz-N9VCzet5yUboI6L0ZbUNHv-GhmwvoCsrdFwtv4YUCukQpoKd8JjQSLZP2Y2UqMPIL4m9sTOqs/deepQA.png?width=600" width="600" class="align-full" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;Figure 1: DeepQA Architecture (Source: The IBM paper)&lt;/p&gt;&lt;p style="text-align: justify;"&gt;The flexibility in the DeepQA architecture is achieved through the use of the UIMA text analysis framework. At one point in the trials, Watson was taking about two hours to generate an answer. The answer was to parallelize Watson with UIMA-AS and this got the response time down to the quiz show's average of 2 to 5 seconds. The improvement in accuracy is even more startling. When the IBM team stared working on Watson, the difference between the show's participants and early prototypes of Watson was huge. Figure 2 depicts the evolution in Watson's performance. It started from the baseline where the precision and recall were nowhere near the cloud of points corresponding to actual human competitors, but gradually reached human level performance. &lt;/p&gt;&lt;p&gt;&lt;a href="http://api.ning.com/files/NOSv*YHdv4D-SexdVctwURtnjiC*vHpyd4Gp8EUPdqDE4Y-LLCxi9cDD*5kBeeUnXV-yGZW4adeYjwLviUOash2z1diPgL33/watsonimprovemtn.png" target="_self"&gt;&lt;img src="http://api.ning.com/files/NOSv*YHdv4D-SexdVctwURtnjiC*vHpyd4Gp8EUPdqDE4Y-LLCxi9cDD*5kBeeUnXV-yGZW4adeYjwLviUOash2z1diPgL33/watsonimprovemtn.png?width=600" width="600" class="align-full" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p style="text-align: center;"&gt; Figure 2: Watson's accuracy over time (Source: The IBM paper)&lt;/p&gt;&lt;p style="text-align: justify;"&gt;What enabled Watson to reach this level of performance? Many of the underlying analysis algorithms aren't new, but have been around in the research community for a long time. More than groundbreaking original research, it is pragmatic engineering that lies at the core of Watson's success and the following are the salient contributory factors:&lt;/p&gt;&lt;ul&gt;&lt;li style="text-align: justify;"&gt;Building an end-to-end system: Very early, the team build a baseline end-to-end system and then kept iterating and improving the system. They defined end-to-end evaluation metrics which captured the performance of the system as a whole, and not focusing only on the individual component accuracies at the initial stages. This helped make the correct trade-offs.  &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;Pervasive Confidence estimation: Every component in Watson gives a confidence estimate along with its response. This is critical since these confidence scores can be aggregated to get the final confidence on the answers and allows easy integration of components of varying accuracy. The rule is that no component is assumed to be perfect, but each makes available its confidence estimate of the answers. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;Many experts: There may be competing algorithms to do the same task. Rather than using the best, the system uses multiple algorithms so as to get diverse results and evidence. The confidence estimates help to blend the diverse results. &lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.&lt;/li&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;li style="text-align: justify;"&gt;Massive parallelism:  As mentioned, exploiting massive parallelism allows looking through a large number of hypotheses.&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;(PS: Cross-posted from &lt;a href="http://peepaal.org/profiles/blogs/watson-the-quiz-champion"&gt;my Peepaal blog post&lt;/a&gt;)&lt;/div&gt;&lt;ul&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-4772151305669499846?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/4772151305669499846/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2011/08/you-must-have-heard-of-ibms-watson.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/4772151305669499846'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/4772151305669499846'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2011/08/you-must-have-heard-of-ibms-watson.html' title='Watson - The Quiz Champion'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://img.youtube.com/vi/qpKoIfTukrA/default.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-2140195392141908684</id><published>2011-07-20T19:38:00.005+05:30</published><updated>2011-07-20T20:20:39.061+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='SMT IBM model1 model2'/><title type='text'>Statistical Machine Translation - IBM Models</title><content type='html'>&lt;div style="text-align: justify;"&gt;At CFILT, a few of us have been working on understanding the IBM Models thoroughly. The &lt;a href="http://acl.ldc.upenn.edu/J/J93/J93-2003.pdf"&gt;IBM paper&lt;/a&gt; on SMT is a classic and seminal paper in the history of Machine Translation, and a must read for anybody wanting to work in this area. Its not an easy read, and we spent quite a lot of time figuring out how the estimation results are derived. Some notes sprung out of working for this discussion, and works out the steps missing in the original paper in detail. Hopefully it will be useful for everybody. These scanned notes of estimation for Model 1 and Model 2 can be found &lt;a href="https://docs.google.com/viewer?a=v&amp;amp;pid=explorer&amp;amp;chrome=true&amp;amp;srcid=0BxsJNvcAVU0HZTg1MjM5ZGYtNjI5Yi00Y2Y3LTg4ZWUtZWY2ZTY4MmQ2ZTUy&amp;amp;hl=en_GB"&gt;here&lt;/a&gt;. This is not a replacement for the original paper, but is just meant to supplement the reading of the original paper. Thanks to &lt;a href="http://www.cse.iitb.ac.in/~miteshk/"&gt;Mitesh&lt;/a&gt; for helping out with the key steps in the derivation. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;You can find the notes &lt;a href="https://docs.google.com/viewer?a=v&amp;amp;pid=explorer&amp;amp;chrome=true&amp;amp;srcid=0BxsJNvcAVU0HZTg1MjM5ZGYtNjI5Yi00Y2Y3LTg4ZWUtZWY2ZTY4MmQ2ZTUy&amp;amp;hl=en_GB"&gt;here&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-2140195392141908684?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/2140195392141908684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2011/07/statistical-machine-translation-ibm.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2140195392141908684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2140195392141908684'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2011/07/statistical-machine-translation-ibm.html' title='Statistical Machine Translation - IBM Models'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-975461766020094727</id><published>2010-12-21T22:46:00.002+05:30</published><updated>2010-12-21T22:49:16.402+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='text_mining language peepaal'/><title type='text'>Beauty of Language</title><content type='html'>&lt;p style="text-align: left;"&gt;Language is so ambiguous, and hence so difficult to analyze. I came across an extreme example the other day, which is kind of representative of the ambiguity in dealing with language. The following sentence can have different meanings depending upon how it is spoken:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;I didn't say he stole the money&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;The change in meaning comes from variation in which word is given stress while speaking. Here are a few interpretations of the sentence, with the word being given stress in bold.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;I&lt;/strong&gt; didn't say he stole the money&lt;/em&gt;&lt;br /&gt;... some else may have said it&lt;br /&gt;&lt;br /&gt;&lt;em&gt;I &lt;strong&gt;didn't&lt;/strong&gt; say he stole the money&lt;/em&gt;&lt;br /&gt;... the literal meaning&lt;br /&gt;&lt;br /&gt;&lt;em&gt;I didn't &lt;strong&gt;say&lt;/strong&gt; he stole the money&lt;/em&gt;&lt;br /&gt;... just hinted, implied ??&lt;br /&gt;&lt;br /&gt;&lt;em&gt;I didn't say &lt;strong&gt;he&lt;/strong&gt; stole the money&lt;/em&gt;&lt;br /&gt;... i didn't mean him&lt;br /&gt;&lt;br /&gt;&lt;em&gt;I didn't say he &lt;strong&gt;stole&lt;/strong&gt; the money&lt;/em&gt;&lt;br /&gt;... may he just borrowed it, with the intention of returning it&lt;/p&gt;&lt;em&gt;I didn't say he stole &lt;strong&gt;the&lt;/strong&gt; money&lt;/em&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt; ... not that money&lt;/p&gt;&lt;em&gt;I didn't say he stole the &lt;strong&gt;money&lt;/strong&gt;&lt;/em&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt; ... not the money, I mean something else - xyz ...&lt;br /&gt;&lt;br /&gt;Most common situations may not be that extreme, but just serves to highlight the challenges to understand text, and currently the state-of-the-art is just skimming the surface.&lt;/p&gt;&lt;p style="text-align: left;"&gt;PS: Cross-posted from my &lt;a href="http://peepaal.org/profiles/blogs/the-beauty-of-language"&gt;Peepaal blog post&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-975461766020094727?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/975461766020094727/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2010/12/beauty-of-language.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/975461766020094727'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/975461766020094727'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2010/12/beauty-of-language.html' title='Beauty of Language'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-8974818399563866444</id><published>2010-01-26T13:55:00.003+05:30</published><updated>2010-01-26T14:16:17.022+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='parallel_programming machine_learning distributed_programming apache mahout'/><title type='text'>Scalable Machine Learning - Apache Mahout</title><content type='html'>&lt;div style="text-align: justify;"&gt;Machine learning algorithms are pretty computationally intensive, work on huge amounts of data and take a lot of time to run. That makes them obvious candidates for running on data parallel distributed programming models like Map-Reduce.&lt;br /&gt;&lt;br /&gt;Although Google's &lt;a href="http://labs.google.com/papers/mapreduce.html"&gt;Map-Reduce paper&lt;/a&gt; does talk about it, there was not much  available in the public domain to do machine learning on a distributed scale. &lt;a href="http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf"&gt;Andrew Ng's paper&lt;/a&gt; gives a common mathematical framework for modeling the most common machine learning algorithms, so that they can be parallelized. Its basically built around the idea of representing computations as summations of simpler computations. Each computation can be a map task, with the final summation being the reduce task.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.ibm.com/developerworks/java/library/j-mahout/"&gt;Apache Mahout&lt;/a&gt; is a project from the Apache Foundation, that started off with Ng's paper and already have implementations for many ML algorithms running on Hadoop. In addition, Mahout also contains the Taste library for building recommendation systems and collaborative filtering systems.&lt;br /&gt;&lt;br /&gt;Hoping to read more on open source ML and practical ML. A couple of books I am looking forward to reading:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a style="font-style: italic;" href="http://oreilly.com/catalog/9780596529321"&gt;Programming Collective Intelligence&lt;/a&gt;, Toby Seagaran&lt;/li&gt;&lt;li&gt;&lt;a style="font-style: italic;" href="http://www.manning.com/ingersoll/"&gt;Taming Text&lt;/a&gt;, Grant S. Ingersoll and Thomas S. Morton&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-8974818399563866444?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/8974818399563866444/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2010/01/scalable-machine-learning-apache-mahout.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8974818399563866444'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/8974818399563866444'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2010/01/scalable-machine-learning-apache-mahout.html' title='Scalable Machine Learning - Apache Mahout'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-3165984911201029761</id><published>2009-08-11T05:38:00.000+05:30</published><updated>2009-08-11T05:39:42.816+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='book-review'/><category scheme='http://www.blogger.com/atom/ns#' term='books'/><title type='text'>Book Review: The Numerati</title><content type='html'>&lt;div style="text-align: justify;"&gt;With the advent of the Web and the fall in electronic prices, we have seen an explosion in digital data in the form of huge databases collecting various pieces of information to ever larger collection of documents. The &lt;a href="http://www.amazon.com/Numerati-Stephen-Baker/dp/0618784608"&gt;Numerati&lt;/a&gt; (a portmanteau between the Number and Illuminati) are the statisticians, mathematicians, computer scientists, linguists and others involved in making sense of this data using sophisticated statistical techniques. The book describes the kind of problems being solved in the following areas, citing various examples at a bunch of organizations like IBM, Intel, Umbria, etc.:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;Workers - building employee profiles, understanding employee networks, using it for optimal use of resources&lt;/li&gt;&lt;li&gt;Shoppers - microtargeting shoppers using personal information to customize service, give recommendations and increase sales&lt;/li&gt;&lt;li&gt;Voters - Understanding voter intent, issues - so that campaign messages can be targeted to focussed groups.&lt;/li&gt;&lt;li&gt;Bloggers - Understanding public opinion from the information on blogosphere, useful to understand sentiments on products, etc.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Medicine - Baker focusses on futuristic health monitoring (like floor tiles which capture your walking patterns!), whereaas he totally ignores contemporary challenges and work in analyzing medical records, genomic and proteomic data.&lt;/li&gt;&lt;li&gt;Terrorism&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Match Making&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;All this comes at a cost. The Numerati has access to vast amounts of personal data, and we don't need an Orwellian Big Brother who is going to use it to learn about us, turn us into commodities and control our lives.&lt;br /&gt;&lt;br /&gt;That's about it in the book - it can be a brisk read, which - you can give it a miss if you think you are familiar with the above topics.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-3165984911201029761?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/3165984911201029761/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2009/08/book-review-numerati.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/3165984911201029761'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/3165984911201029761'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2009/08/book-review-numerati.html' title='Book Review: The Numerati'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-5928836959568042513</id><published>2009-08-11T05:36:00.000+05:30</published><updated>2009-08-11T05:37:12.441+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='book-review'/><category scheme='http://www.blogger.com/atom/ns#' term='books'/><category scheme='http://www.blogger.com/atom/ns#' term='history'/><title type='text'>Book Review: The Lady Tasting Tea</title><content type='html'>&lt;p align="justify"&gt;A lady claims that the taste of tea differs when milk is poured to tea leaves as opposed to adding tea leaves into a cup of milk. Everyone at the small party scoffs at the suggestion, except Ronald Aylmer Fisher. Fisher designs an experiment that would statistically establish the lady's claims. He creates a sample set containing tea prepared in either ways, and lo and behold - the story goes that the lady identifies each cup correctly. Fisher uses this example to explain the design of experiments in his book 'The Design of Experiments'. This anecdote sets up the book. '&lt;a href="http://www.amazon.com/Lady-Tasting-Tea-Statistics-Revolutionized/dp/0805071342/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1249892070&amp;amp;sr=1-1"&gt;The Lady Tasting Tea&lt;/a&gt;' is the story of the development of statistics, Fisher having built the pillars of statistics as it stands today.&lt;/p&gt;&lt;p align="justify"&gt;I started reading this book, while looking around to brush my statistics; thought it would be a good idea to know the history of the subject I am exploring. That's particularly relevant in sciences filled with uncertainties like statistics, economics, linguistics; where the characteristics of the individual seem to contribute to the development of the theory, and there's a story behind things which seem arbitrary. &lt;/p&gt;&lt;p align="justify"&gt;David Salsburg takes us through an entertaining journey starting with the earliest breakthroughs by Karl Pearson and William Gossett, going to the pioneering foundational works of the acerbic genius Ronald Fisher, the cheerful Jerzy Newman, and the multitalented Andrei Kolmogorov. Apart from these pioneers, Salsburg very vividly sketches the lives and contributions of Egon Pearson (hypothesis testing), Chester Bliss (probit analysis), John Tukey (exploratory data analysis), Frank Wilcoxon (non-parametric methods), EJG Pitman (non-parametric methods), Prasanta Chandra Mahalabonis (sampling theory), Samuel Wilks (Founder - Statistical Research Group, Princeton) , George Box (robust statistics) and Edward Deming (statistical quality control). &lt;/p&gt;&lt;p align="justify"&gt;Some of the chapter names are interesting, and they are as good as the title of the book. It reminds me of &lt;a href="http://www.amazon.com/Mythical-Man-Month-Software-Engineering-Anniversary/dp/0201835959/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1249892132&amp;amp;sr=1-1"&gt;'The Mythical Man Month&lt;/a&gt;''s memorable illustrative sketches. Sample this: &lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;div align="justify"&gt;The Mozart of Mathematics - Andrei Kolmorogov&lt;/div&gt;&lt;/li&gt;&lt;li&gt;&lt;div align="justify"&gt;The Picasso of Statistics - John Tukey&lt;/div&gt;&lt;/li&gt;&lt;li&gt;&lt;div align="justify"&gt;The March of the Martingales - on the work of Paul Levy&lt;/div&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p align="justify"&gt;Read this if you are a fan of scientific history. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-5928836959568042513?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/5928836959568042513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2009/08/book-review-lady-tasting-tea.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/5928836959568042513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/5928836959568042513'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2009/08/book-review-lady-tasting-tea.html' title='Book Review: The Lady Tasting Tea'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-3148202436065701712</id><published>2009-05-02T17:08:00.007+05:30</published><updated>2009-05-02T18:44:35.462+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='UIMA'/><category scheme='http://www.blogger.com/atom/ns#' term='text_engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='GATE'/><title type='text'>Text Engineering Frameworks</title><content type='html'>&lt;span style="font-weight: bold;font-size:100%;" &gt;What is a text engineering framework? &lt;/span&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;With the volume of unstructured text going through the roof, and the need to make sense of them, so are the efforts to analyze them. Different software tools for language analysis and data mining, attacking myriad language analysis problems have been developed. While each system concentrates on solving the problem at hand, there remains the enviable task of gluing together these language technologies. &lt;span style="font-style: italic;"&gt;All language technologies need to worry about common problems like representation of data and metadata, modularization of the software components, and interaction between them.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Each system takes its own approach to handling these problems, in addition to solving the central problem. This is where a text engineering system steps in. What a text engineering framework provides is an architecture and out-of-the-box support for rapid development of highly modularized, scalable language technology components, which can interface with other components - thus improving the process of creating language technology applications. The framework does all the plumbing necessary to create interesting language technology applications. &lt;span style="font-style: italic;"&gt;A good analogy would be that the framework is the OS platform on which applications are built. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Architecture of a T&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;ext Engineering Framework&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;While different systems may have their own architectures, the generic architecture described here is the one that forms the basis of the two most popular text engineering frameworks, &lt;a href="http://gate.ac.uk"&gt;GATE&lt;/a&gt; (General Architecture for Text Engineering) and &lt;a href="http://incubator.apache.org/uima/"&gt;UIMA&lt;/a&gt; (Unstructured Information Management Access). The two key services that the framework provides are: data/metadata management services and analysis component development services.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Data Management Se&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;rvices&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The most important problem facing NLP tools is the management of data, hence the representation of data is given a central importance in the framework. The basic unit of unstructured data to be analyzed is a &lt;span style="font-style: italic; font-weight: bold;"&gt;Document&lt;/span&gt;. This corresponds to a single artifact to be analyzed like a single medical report, a news article, etc. The unstructured data need not be restricted to text, but it could be audio, video and other multimedia data. The focus of this article would be text, but most the concepts elaborated here would apply to other media too. In NLP applications, it is common to process large collections of documents for analysis. The framework represents a collection of Documents by a &lt;span style="font-style: italic; font-weight: bold;"&gt;Corpus&lt;/span&gt; abstraction.&lt;br /&gt;&lt;br /&gt;Each NLP tool generates metadata for the Document. For instance, a tokeniser would generate tokens, a POS tagger would generate Part-Of-Speech tags for each token, a noun phrase chunker would identify noun phrase chunks and a named entity recognizer would generate labels for chunks of text. There needs to be a consistent method to represent all this metadata. This is achieved by using an &lt;span style="font-weight: bold; font-style: italic;"&gt;Annotation&lt;/span&gt; object, which represents metadata associated with a contiguous chunk of text. To illustrate the idea, consider the following sentence:&lt;br /&gt;"&lt;span style="font-style: italic;"&gt;In a perfect world&lt;/span&gt;&lt;span style="font-style: italic;"&gt;, all the people would be like cats are, at two o'clock in the afternoon&lt;/span&gt;."&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_muYmXXTNfps/SfxDVApAu-I/AAAAAAAABbA/GzCpSDv4fE8/s1600-h/annotation.jpeg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 477px; height: 109px;" src="http://3.bp.blogspot.com/_muYmXXTNfps/SfxDVApAu-I/AAAAAAAABbA/GzCpSDv4fE8/s400/annotation.jpeg" alt="" id="BLOGGER_PHOTO_ID_5331210087128153058" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The tokenizer would identify tokens, each token like "perfect" represented by an &lt;span style="font-weight: bold;"&gt;Annotation&lt;/span&gt;, whose type is "&lt;span style="font-weight: bold; font-style: italic;"&gt;Token&lt;/span&gt;". Each annotation has a start and end offset associated with it, which identifies its position in the &lt;span style="font-weight: bold;"&gt;Document&lt;/span&gt;. Information about the annotation can be stored in a key-value pairs called &lt;span style="font-weight: bold; font-style: italic;"&gt;Features&lt;/span&gt;. This allows arbitrarily complex data to be associated with the annotation. For instance, the Token annotation could have a "string" feature to represent the text of the token, a "kind" feature to indicate if the token is a word, number, or punctuation, a "root" feature which contains its morphological root.&lt;br /&gt;&lt;br /&gt;The scheme of representing metadata described above allows different kinds of metadata from different NLP components to be accessed and manipulated using the same interface. Positional information about the metadata can be captured, and arbitrarily complex data can be associated - since the feature values could be complex objects themselves. Annotations can be added at various levels of detail to the same chunk of text. For instance, the phrase "&lt;span style="font-style: italic;"&gt;a perfect world&lt;/span&gt;" can have "Token" annotations for each token, "POS" annotations to represent part-of-speech information for each token, "NP" annotation over the entire phrase to represent a noun phrase chunk. I&lt;br /&gt;&lt;br /&gt;It should now be obvious that the annotations constitute a data exchange format between various NLP components, to build more complex analysis of the text. An entire declarative type system can be built using these annotations for an application, as is done in UIMA. It is possible to do pattern matching over these annotations, as provided by the JAPE language in GATE. The frameworks provide implementations of these abstractions, thus freeing applications from the data management chores.&lt;br /&gt;&lt;br /&gt;The architecture decribed above evolved during  the TIPSTER conferences . One of the popular ways of serializing this data is the XML stand-off markup, which separates the annotation metadata from the data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Text Analysis Development Services&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;NLP applications generally consist of a number of steps, each doing some part of the analysis, building upon the analysis done in the previous stage. To support this application development paradigm, the framework represents e ach NLP task by a processing resource (PR). The PR is a component which performs a single task like tokenizing, POS tagging, or something even simpler like mapping one set of annotations to another (for adaption purposes). The data interface to the PR is  specified by the kind of input annotations that it requires, and the annotations it generates. For instance, the POS tagger requires "Token" annotation as input and generates "POS" annotation as output. The PR's role can be more accurately characterized as an annotator. Each PR is a reusable software component, that can be used in a creating NLP applications. The same POS tagger can be used in different applications as long as its input and output requirements are satisfied. A number of PRs can be strung together to create a pipeline. A example of an NP-chunking pipeline is shown below.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_muYmXXTNfps/SfxFbA0MSzI/AAAAAAAABbI/852F1XH0rVo/s1600-h/pipeline.jpeg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 217px; height: 193px;" src="http://3.bp.blogspot.com/_muYmXXTNfps/SfxFbA0MSzI/AAAAAAAABbI/852F1XH0rVo/s400/pipeline.jpeg" alt="" id="BLOGGER_PHOTO_ID_5331212389277518642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This pipeline is a sequential pipeline, but you can as well imagine conditional, looped and other pipeline configurations. The scheme described above constitutes a modular, loosely-coupled architecture for a text engineering application. Each PR in the pipeline may be replaced by an equivalent PR as long as it satisfies the data interface requirements, allowing you to test different configurations. The framework defines the common interfaces for PRs, provides different pipeline implementations and allows for declarative specification of PRs and pipelines. In a nutshell, the framework provides all the plumbing required to build an NLP application, while the developer can focus on developing the smart innovations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Other facilities provided by the framework&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;For making the application development easier,&lt;br /&gt;1. The framework provides visual tools for managing language resources, creating pipelines, running applications, observing annotations, editing annotations, creation of  training sets.&lt;br /&gt;2. The framework may ship with off-the-shelf components for common NLP tasks like tokenization, sentence identification, dictionary lookups, POS tagging, machine learning interfaces, etc. This allows rapid prototyping of applications , using these ready-to-use components. GATE, for example, ships with the ANNIE toolkit.&lt;br /&gt;3. The framework developers maintain a component repository, which allow the developer community to share the resusable PRs that are developed, and make use of the work done by others.&lt;br /&gt;&lt;br /&gt;In summary, if you are developing NLP applications you should use a text engineering framework to make use of the wealth of components that have been developed, increase productivity and build NLP applications which are modular and loosely coupled.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-3148202436065701712?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/3148202436065701712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2009/05/text-engineering-frameworks.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/3148202436065701712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/3148202436065701712'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2009/05/text-engineering-frameworks.html' title='Text Engineering Frameworks'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_muYmXXTNfps/SfxDVApAu-I/AAAAAAAABbA/GzCpSDv4fE8/s72-c/annotation.jpeg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-180889853452902819</id><published>2009-04-26T00:06:00.005+05:30</published><updated>2009-04-26T14:34:23.337+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='medical_informatics'/><title type='text'>De-Identification of Personal Health Information</title><content type='html'>&lt;div style="text-align: justify;"&gt;I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.&lt;br /&gt;&lt;br /&gt;Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The &lt;a href="http://privacyruleandresearch.nih.gov/pr_02.asp"&gt;HIPAA regulation&lt;/a&gt; lays down the rules for the handling of PHI.&lt;br /&gt;&lt;br /&gt;Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.&lt;br /&gt;&lt;br /&gt;There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (&lt;a href="http://cphs.berkeley.edu/content/hipaa/hipaa18.htm"&gt;Entire list here&lt;/a&gt;). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.&lt;br /&gt;&lt;br /&gt;A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this &lt;a href="http://www.citeulike.org/user/anoop_kunchukuttan/article/4313105"&gt;paper&lt;/a&gt;. Here are a few de-identification systems that are available:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;a href="http://www.physionet.org/physiotools/deid/"&gt;PhysioNet DeId&lt;/a&gt; (Open Source)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://spin.nci.nih.gov/content/HMS_Scrubber_v1.0b.zip"&gt;Harvard Medical School Scrubber&lt;/a&gt; (Open Source)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.de-idata.com/"&gt;Data Corp DeId&lt;/a&gt; (Commercial)&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;For research purposes, a gold standard data set containing surrogate PHI data is available on the &lt;a href="http://www.physionet.org/physiotools/deid/#data"&gt;PhysioNet page&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-180889853452902819?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/180889853452902819/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2009/04/de-identification-of-personal-health.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/180889853452902819'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/180889853452902819'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2009/04/de-identification-of-personal-health.html' title='De-Identification of Personal Health Information'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1879978874853957111.post-2156022354860546136</id><published>2009-04-25T22:39:00.002+05:30</published><updated>2009-04-25T23:46:15.403+05:30</updated><title type='text'>Yet Another Blog On Organizing Information</title><content type='html'>Data and information everywhere. The digital age is generating so much information, that it has fast outgrown our ability to comprehend it. 'Information Overload', we call it. These are the questions that are posed to us:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How do I find information that I want?&lt;/li&gt;&lt;li&gt;What information is relevant to my need?&lt;/li&gt;&lt;li&gt;Ok, this is way too much information than I can handle. I would like to have summary of the same.&lt;/li&gt;&lt;li&gt;In this huge infobase, is there some useful information that isn't obvious? Some patterns, trends that may be useful.&lt;/li&gt;&lt;li&gt;There are a lot of smart people generating content. How can the collective intelligence of these people augment my search for information? &lt;/li&gt;&lt;/ul&gt;These questions have had us hooked for a long time, and so have the solutions people have developed to tackle these questions. Search engines to help you find information, business intelligence tools to make find patterns in huge volumes of data, information extraction systems to summarize information in human generated content, recommendation systems to  bring information relevant to your need and study of social networks to harness the "collective intelligence" of the crowd.&lt;br /&gt;&lt;br /&gt;The rabbit hole goes deeper. These solutions are built on the more fundamental sciences of statistics, pattern recognition, artificial intelligence and natural language understanding.&lt;br /&gt;&lt;br /&gt;This is not the end, for the more fundamental questions we are posed with are about the nature of cognition, the understanding of language, the organization of the knowledge and the active role of the human observer in the perception of information. I think this is the holy grail that we are all in pursuit of.&lt;br /&gt;&lt;br /&gt;We are beginners in this exciting field,. This is a place to share what we learn, what we do and to benefit from the "collective intelligence" of all who visit this page.&lt;br /&gt;&lt;br /&gt;While the challenges span many problems, there are some that we are currently working on. Dhaval currently works on optimizing ad-networks and takes an active interest in search engines. I currently work on information extraction from text and medical informatics. So for now you may find a certain bias towards these topics, and related topics on this blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1879978874853957111-2156022354860546136?l=organize-information.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://organize-information.blogspot.com/feeds/2156022354860546136/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://organize-information.blogspot.com/2009/04/yet-another-blog-on-organizing.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2156022354860546136'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1879978874853957111/posts/default/2156022354860546136'/><link rel='alternate' type='text/html' href='http://organize-information.blogspot.com/2009/04/yet-another-blog-on-organizing.html' title='Yet Another Blog On Organizing Information'/><author><name>Anoop Kunchukuttan</name><uri>http://www.blogger.com/profile/03230469717630854695</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://1.bp.blogspot.com/_muYmXXTNfps/SM0w2mLjMcI/AAAAAAAAAQw/6URNFO4WFAg/S220/anoop_profile_small_photo.png'/></author><thr:total>2</thr:total></entry></feed>
