tag:blogger.com,1999:blog-18799788748539571112024-03-05T20:13:27.242+05:30On Organizing InformationAbout making sense of unstructured informationAnoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.comBlogger22125tag:blogger.com,1999:blog-1879978874853957111.post-54234590859521312882015-01-12T00:09:00.002+05:302015-01-12T00:09:49.831+05:30No Roman Hindi please<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: inherit;">Chetan Bhagat wrote an <a href="http://blogs.timesofindia.indiatimes.com/The-underage-optimist/scripting-change-bhasha-bachao-roman-hindi-apnao/?utm_source=TOInewHP_TILwidget&utm_campaign=TOInewHP&utm_medium=Widget_Stry" target="_blank">opinion piece </a>calling for the replacement of the Devanagari script with the Roman script, describing it as an essential step for saving the Hindi language. <br /><br />This is fundamentally a bad idea since: </span></div>
<ul style="text-align: justify;">
<li><span style="font-family: inherit;">The Roman script is clearly inferior to the Devanagari script. For instance, it is ambiguous in representing sounds: c can either be च (as in 'touch') or क (as in 'cut'). Why would you want to throw away a script designed on <a href="http://en.wikipedia.org/wiki/Devanagari#Consonants" target="_blank">scientific principles of sound organization</a> for one which is <a href="http://en.wikipedia.org/wiki/Ghoti" target="_blank">fairly arbitrary</a>.</span></li>
<li><span style="font-family: inherit;">While having language specific hardware keyboards never took off, in the era of touch keyboards designing language specific keyboards is no barrier at all, and all smart tricks done for English keyboards (word completion, swipe, etc.) can be easily replicated for Devanagari. In fact, we can have<a href="https://play.google.com/store/apps/details?id=iit.android.swarachakra&hl=en" target="_blank"> innovative designs</a> to make input easier. We can go further and have handwriting recognition systems. </span></li>
<li><span style="font-family: inherit;">In fact, even if we have to use the Roman keyboard on physical keyboards, there is no reason to adopt the Roman script for the language. <a href="http://www.google.com/inputtools/" target="_blank">Transliteration systems</a> have become quite good to handle a wide variety of ambiguous mappings from Roman to Devanagari.</span></li>
<li><span style="font-family: inherit;">If there is a need for a common national script, then Devanagari should be the natural choice since it can be representative of all major Indian scripts and follows the same principles. In fact, languages in India which don't have much of a written history should be based on extensions of Devanagari. That will surely will be a political hot potato, so we maybe revive the <a href="http://en.wikipedia.org/wiki/Brahmi_script" target="_blank">Brahmi script</a> with suitable extensions to accomodate all scripts in India, since they are but variants of the Brahmi script. </span></li>
</ul>
<div style="text-align: justify;">
<span style="font-family: inherit;"></span><br /><span style="font-family: inherit;"></span><span style="font-family: inherit;">What is needed is that free, open source input solutions be developed for these core input methods so that they are widely and easily available and can become building blocks for language technologies. <br /><br />Chetan Bhagat was recently <a href="http://blogs.timesofindia.indiatimes.com/The-underage-optimist/would-we-ban-autos-or-cycle-rickshaws-if-a-rape-occurred-in-one/" target="_blank">advocating non-jugaad solutions </a>to Uber for the new-generation transportation solutions. I wonder why he proposes such jugadu solutions in this case? In fact, he is bent on destroying a 2000 year old, well-engineered solution. <br /><br />Technical reasons apart, I don't know if there is a reason for this unwarranted alarm since the language seems to be thriving. While I don't follow Hindi literature, atleast in popular culture (news, TV, Internet, etc.) the availability of Hindi content has only increased. The Union Government and the state governments in Hindi speaking states use Hindi for their official activities. In any case, how will the change of script help to preserve the language? Bhagat does not put forth any reasons. I agree that the use of English as the language of power and intellectual discourse may put regional languages at risk in the future, but the solution would be to enable people to access content and communicate in their native languages as has been done in Europe. With the rapid development of language technologies in recent times, that is <a href="http://www.tdil-dc.in/" target="_blank">clearly possible</a>. Making people use foreign scripts will only result in a sense of inferiority and cut them off from the vast literature which is written in the Devanagari script. Instead of rejuvenating the language, it may just hasten its death. </span></div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-40861987101712803842014-11-08T20:52:00.000+05:302014-11-08T20:52:00.953+05:30Statistical Machine Translation: Resources for Indian languages<div dir="ltr" style="text-align: left;" trbidi="on">
At the <a href="http://www.cfilt.iitb.ac.in/" target="_blank">Center For Indian Language Technology</a>, IIT Bombay, we have hosted Shata-Anuvaadak (100 Translators), a broad coverage <span>Statisitical</span> Machine Translation system for Indian languages. It currently supports translation between 11 Indian languages:<br /><br />
<ul>
<li> Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani</li>
<li> Dravidian languages: Tamil, Telugu, Malayalam</li>
<li> English </li>
</ul>
<br />It
is a Phrase-Based MT system with pre-processing and post-processing
extensions. The pre-processing includes source-side reordering for
English to Indian language translation. The post-processing includes
transliteration between Indian languages for OOV words. The system can
be accessed at: <br /><br /> <a href="http://www.cfilt.iitb.ac.in/indic-translator/" target="_blank">http://www.cfilt.iitb.ac.in/<wbr></wbr>indic-translator </a><br /><br />For more details, see the following publication: <br /><br /><div style="margin-left: 40px;">
Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. 2014. <i> <span>Shata</span>-Anuvadak: Tackling Multiway Translation of Indian Languages</i><i> </i><span>.</span> Language and Resources and Evaluation Conference <b>(LREC 2014)</b>. 2014.</div>
<br />We
are also making available software and resources developed in the
Center for the system and for ongoing research. These are available
under an open source license for research use. These include: <br /><b><br />Software</b><br /><ul>
<li>Indian
Language, NLP tools: Common NLP tools for Indian languages that are
useful for machine translation. Unicode Normalizers, Tokenizers,
Morphology-<span>analysers</span> and Transliteration system. </li>
<li>Source Side Reodering system</li>
<li>A simple experiment management system for Moses</li>
</ul>
<b>Resources</b><br /><ul>
<li>Translation Models for Phrase based SMT systems all language pairs in Shata-<span>anuvaadak</span></li>
<li>Language Models for all <span>language</span> in <span>Shata</span>-<span>anuvaadak</span></li>
<li>Transliteration models for some language pairs (Moses-based)</li>
</ul>
<br />You can access these resources at: <br /><br /> <a href="http://www.cfilt.iitb.ac.in/static/download.html" target="_blank">http://www.cfilt.iitb.ac.in/<wbr></wbr>static/download.html</a><br /><br /></div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com2tag:blogger.com,1999:blog-1879978874853957111.post-19304631010266042082014-06-04T15:35:00.002+05:302014-06-04T15:37:03.013+05:30LREC 2014 - Some paper/posters<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across. </div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for on machine translation, crowdsourcing and grammar correction.</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<div style="font-size: 13.333333969116211px;">
In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets. </div>
<div style="font-size: 13.333333969116211px;">
<br /></div>
<div style="font-size: 13.333333969116211px;">
Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters. </div>
</div>
<br />
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br />
The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work: </div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<strong style="line-height: 1.428571em;"><span style="font-size: 14pt; line-height: 1.428571em;">Machine Translation</span></strong></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<strong style="line-height: 1.428571em;"><br clear="none" /></strong></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Aligning Parallel Texts with InterText: Pavel Vondřička</em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.<br />
<br clear="none" />
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne</em><br />
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on <em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">hunalign</em>, Microsoft Bilingual Aligner and <em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Vanilla</em>shows the MBA outperforms the rest.</div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Ahmed Abdelali</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Francisco Guzman</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Hassan Sajjad</a> and <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Stephan Vogel</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages. </div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Machine Translationness: Machine-likeness in Machine Translation Evaluation: <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Joaquim Moré</a> and <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Salvador Climent</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Machine Translation for Subtitling: A Large-Scale Evaluation -- <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Thierry Etchegoyhen</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Lindsay Bywood</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Mark Fishel</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Panayota Georgakopoulou</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Jie Jiang</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Gerard Van Loenhout</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Arantza Del Pozo</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Mirjam Sepesy Maucec</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Anja Turner</a> and <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Martin Volk</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality. </div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Joke Daems</a>, <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Lieve Macken</a> and <a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Sonia Vandepitte</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Translation Errors from English to Portuguese: an Annotated Corpus: </a><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Angela Costa</a><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">, </a><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Tiago Luís</a><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;"> and </a><a href="https://www.blogger.com/blogger.g?blogID=1879978874853957111" shape="rect" style="border: 0px; color: #222222; line-height: 1.428571em; margin: 0px; padding: 0px;">Luísa Coheur</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated. </div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Loaiciga_Sharid" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Sharid Loaiciga</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Meyer_Thomas" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Thomas Meyer</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Popescu-Belis_Andrei" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Andrei Popescu-Belis</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><br clear="none" /></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">The taraXÜ Corpus of Human-Annotated Machine Translations - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Avramidis_Eleftherios" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Eleftherios Avramidis</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Burchardt_Aljoscha" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Aljoscha Burchardt</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Hunsicker_Sabine" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Sabine Hunsicker</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Popovic_Maja" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Maja Popović</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Tscherwinka_Cindy" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Cindy Tscherwinka</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Vilar_David" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">David Vilar</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Uszkoreit_Hans" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Hans Uszkoreit</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">CFT13: a Resource for Research into the Post-editing Process - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Carl_Michael" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Michael Carl</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Martinez_Garcia_Mercedes" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Mercedes Martínez García</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Mesa-Lao_Bartolome" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Bartolomé Mesa-Lao</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/835.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Bojar_Ondrej" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Ondrej Bojar</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Diatka_Vojtech" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Vojtěch Diatka</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Rychly_Pavel" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Pavel Rychlý</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Stranak_Pavel" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Pavel Stranak</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Suchomel_Vit" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Vit Suchomel</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Tamchyna_Ales" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Aleš Tamchyna</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Zeman_Daniel" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Daniel Zeman</a></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/1115.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">A Corpus of Machine Translation Errors Extracted from Translation Students Exercises</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Wisniewski_Guillaume" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Guillaume Wisniewski</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Kubler_Natalie" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Natalie Kübler</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Yvon_Francois" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">François Yvon</a></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;">Innovations in Parallel Corpus Search Tools - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Volk_Martin" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Martin Volk</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Graen_Johannes" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Johannes Graën</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Callegaro_Elena" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Elena Callegaro</a></span></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;"><br clear="none" /></span></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/510.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Gilmanov_Timur" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Timur Gilmanov</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Scrivner_Olga" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Olga Scrivner</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Kubler_Sandra" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Sandra Kübler</a></span></div>
<div>
<br /></div>
</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<strong style="line-height: 1.428571em;"><span style="font-size: 14pt; line-height: 1.428571em;">GRAMMAR CORRECTION</span></strong></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><br clear="none" /></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/606.html" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">The MERLIN corpus: Learner Language and the CEFR</a><span style="font-size: 10pt; line-height: 1.428571em;">: </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Boyd_Adriane" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Adriane Boyd</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Hana_Jirka" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Jirka Hana</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Nicolas_Lionel" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Lionel Nicolas</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Meurers_Detmar" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Detmar Meurers</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Wisniewski_Katrin" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Katrin Wisniewski</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Abel_Andrea" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Andrea Abel</a><span style="font-size: 10pt; line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Schone_Karin" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Karin Schöne</a><span style="font-size: 10pt; line-height: 1.428571em;">,</span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Stindlova_Barbora" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Barbora Štindlová</a><span style="font-size: 10pt; line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Vettori_Chiara" shape="rect" style="border: 0px; color: #047ac6; font-size: 10pt; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Chiara Vettori</a></em></div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><br clear="none" /></em>The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds: </div>
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<ul style="border: 0px; line-height: 1.428571em; list-style-position: outside; margin: 0.2857em 0px 0.714285em 2em; padding: 0px;">
<li style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Metadata about the author and the test</li>
<li style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Test ratings according to the CEFR framework</li>
<li style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Error annotations</li>
<li style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Annotations to encourage second language acquisition research</li>
</ul>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed. </div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">KoKo: an L1 Learner Corpus for German: <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Abel_Andrea" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">andrea Abel</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Glaznieks_Aivars" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Aivars Glaznieks</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Nicolas_Lionel" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Lionel Nicolas</a>* and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Stemle_Egon" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Egon Stemle</a></em></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development. </div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<br clear="none" /></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Building a Reference Lexicon for Countability in English: <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Kiss_Tibor" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Tibor Kiss</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Pelletier_Francis_Jeffry" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Francis Jeffry Pelletier</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Stadtfeld_Tobias" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Tobias Stadtfeld</a></em></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;">The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach</span></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;"><br clear="none" /></span></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">Large Scale Arabic Error Annotation: Guidelines and Framework: <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Zaghouani_Wajdi" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Wajdi Zaghouani</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Mohit_Behrang" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Behrang Mohit</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Habash_Nizar" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Nizar Habash</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Obeid_Ossama" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Ossama Obeid</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Tomeh_Nadi" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Nadi Tomeh</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Rozovskaya_Alla" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Alla Rozovskaya</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Farra_Noura" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Noura Farra</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Alkuhlani_Sarah" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Sarah Alkuhlani</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Oflazer_Kemal" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Kemal Oflazer</a></em></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;">Learner corpora for Arabic as L2</span></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<span style="line-height: 1.428571em;"><br clear="none" /></span></div>
<div style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;">A Comparison of MT Errors and ESL Errors - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#B._Hashemi_Homa" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Homa B. Hashemi</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Hwa_Rebecca" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Rebecca Hwa</a></em></div>
</div>
</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<strong style="color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em;"><span style="font-size: 14pt; line-height: 1.428571em;">Crowdsourcing</span></strong></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/497.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Sabou_Marta" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Marta Sabou</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Bontcheva_Kalina" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Kalina Bontcheva</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Derczynski_Leon" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Leon Derczynski</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Scharl_Arno" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Arno Scharl</a></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><br clear="none" /></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/132.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#sinha_manjira" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">manjira sinha</a>, <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Dasgupta_Tirthankar" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Tirthankar Dasgupta</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Basu_Anupam" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Anupam Basu</a></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><br clear="none" /></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/319.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Collaboration in the Production of a Massively Multilingual Lexicon</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Benjamin_Martin" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Martin Benjamin</a></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;"><br clear="none" /></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="font-size: 10pt; line-height: 1.428571em;">Online Experiments with the Percy Software Framework - Experiences and some Early Results - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Draxler_Christoph" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Christoph Draxler</a></span></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><strong style="line-height: 1.428571em;"><span style="font-size: 14pt; line-height: 1.428571em;"><br clear="none" /></span></strong></em></div>
<div style="background-color: white; border: 0px; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;">
<em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;">sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Sabou_Marta" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Marta Sabou</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Bontcheva_Kalina" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Kalina Bontcheva</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Derczynski_Leon" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Leon Derczynski</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Scharl_Arno" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Arno Scharl</a></span></em></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13.333333969116211px;">
<div style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 19.983333587646484px; margin: 0px; padding: 0px;">
<strong style="line-height: 1.428571em;"><span style="font-size: 14pt; line-height: 1.428571em;">Some interesting papers</span></strong></div>
<div style="border: 0px; margin: 0px; padding: 0px;">
<ul style="border: 0px; list-style-position: outside; margin: 0.2857em 0px 0.714285em 2em; padding: 0px;">
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;">A Database for Measuring Linguistic Information Content - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Sproat_Richard" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Richard Sproat</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Cartoni_Bruno" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Bruno Cartoni</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Choe_Hyunjeong" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Hyunjeong Choe</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Huynh_David" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">David Huynh</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Ha_Linne" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Linne Ha</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Rajakumar_Ravindran" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Ravindran Rajakumar</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Wenzel-Grondie_Evelyn" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Evelyn Wenzel-Grondie</a></span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;">Developing Politeness Annotated Corpus of Hindi Blogs - Ritesh Kumar</span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;">Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Song_Zhiyi" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Zhiyi Song</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Strassel_Stephanie" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Stephanie Strassel</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Lee_Haejoong" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Haejoong Lee</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Walker_Kevin" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Kevin Walker</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Wright_Jonathan" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Jonathan Wright</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Garland_Jennifer" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Jennifer Garland</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Fore_Dana" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Dana Fore</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Gainor_Brian" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Brian Gainor</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Cabe_Preston" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Preston Cabe</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Thomas_Thomas" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Thomas Thomas</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Callahan_Brendan" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Brendan Callahan</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Sawyer_Ann" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Ann Sawyer</a></span></span></span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;">The Ellogon Pattern Engine: Context-free Grammars over Annotations - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Petasis_Georgios" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Georgios Petasis</a></span></span></span></span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/1083.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Etymological WordNet: Tracing the History of Words</a> - Gerard de Mello</span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;">Distributed Distributional Similarities of Google Books over the Centuries - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Riedl_Martin" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Martin Riedl</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Steuer_Richard" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Richard Steuer</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Biemann_Chris" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Chris Biemann</a></span></span></span></span></span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><a href="http://www.lrec-conf.org/proceedings/lrec2014/summaries/913.html" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings</a> - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Buitelaar_Paul" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Paul Buitelaar</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Bordea_Georgeta" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Georgeta Bordea</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Coughlan_Barry" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Barry Coughlan</a> </span></em></li>
<li style="border: 0px; color: black; font-family: Helvetica, Arial, 'Droid Sans', sans-serif; font-size: 14px; line-height: 1.428571em; margin: 0px; padding: 0px;"><em style="border: 0px; line-height: 1.428571em; margin: 0px; padding: 0px;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;"><span style="line-height: 1.428571em;">Linguistic Landscaping of South Asia using Digital Language Resources: Genetic vs. Areal Linguistics - <a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Borin_Lars" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Lars Borin</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Saxena_Anju" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Anju Saxena</a><span style="line-height: 1.428571em;">, </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Rama_Taraka" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Taraka Rama</a><span style="line-height: 1.428571em;"> and </span><a href="http://www.lrec-conf.org/proceedings/lrec2014/authors.html#Comrie_Bernard" shape="rect" style="border: 0px; color: #047ac6; line-height: 1.428571em; margin: 0px; padding: 0px;" target="_blank">Bernard Comrie</a></span></span></span></span></span></span></em></li>
<li style="border: 0px; margin: 0px; padding: 0px;"><em style="border: 0px; margin: 0px; padding: 0px;"><span style="color: black; font-family: Helvetica, Arial, Droid Sans, sans-serif;"><span style="font-size: 14.44444465637207px; line-height: 19.981481552124023px;"><em style="border: 0px; color: #222222; font-family: arial; font-size: small; line-height: normal; margin: 0px; padding: 0px;"><span style="color: black; font-family: Helvetica, Arial, Droid Sans, sans-serif;"><span style="font-size: 14.44444465637207px; line-height: 19.981481552124023px;">Indian Subcontinent Language Vitalization - </span></span></em>András Kornai and Pushpak Bhattacharyya</span></span></em></li>
</ul>
</div>
</div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-46818902481128331932013-11-23T19:42:00.000+05:302013-11-23T19:42:26.629+05:30A Systematic Exploration of Diversity in Machine Translation - Paper Summary<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<br />
<div style="text-align: justify;">
An interesting paper regarding generating top-k translation outputs. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013).<i> A Systematic Exploration of Diversity in Machine Translation</i>.EMNLP 2013</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This paper discusses:<br /><br />1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.<br />2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing<br /><br />The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions. <br />The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations. <br /><br />This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set. <br /><br />Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists. </div>
<div style="text-align: justify;">
<br />A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes. </div>
<div style="text-align: justify;">
<br />With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.</div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-11007361946186948022013-11-12T14:21:00.001+05:302013-11-12T14:21:51.531+05:30Large Data sources for NLP from Google<div dir="ltr" style="text-align: left;" trbidi="on">
Google has made available two large and rich sources for NLP research:<br />
<ul style="text-align: left;">
<li><a href="http://storage.googleapis.com/books/ngrams/books/datasetsv2.html" target="_blank">Google Books N-gram corpus</a></li>
<li><a href="http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html" target="_blank">Google Books Syntactic N-gram corpus</a></li>
</ul>
These have been described in the following papers:<br />
<ul style="text-align: left;">
<li><div class="gs_citr" id="gs_cit0" tabindex="0">
<span style="font-size: small;"><span style="font-family: inherit;"><span style="color: #666666;"><span style="color: #283769;"><span style="color: black;"><span style="font-size: small;"><span style="font-family: inherit;"><span style="color: #666666;"><span style="color: #283769;"><span style="color: black;"><span style="line-height: 1.5;"> </span></span><span style="line-height: 1.5;"><span style="color: black;">Jean-Baptiste Michel*, Yuan Kui Shen, Aviva
Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P.
Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker,
Martin A. Nowak, and Erez Lieberman
Aiden. </span></span></span></span></span></span>Quantitative Analysis of Culture Using Millions of Digitized Books.</span><span style="line-height: 1.5;"><span style="color: black;"> </span><i><span style="color: black;">Scie</span>nce</i></span></span></span>. 2011.</span></span></div>
</li>
</ul>
<div class="gs_citr" id="gs_cit0" tabindex="0">
<ul>
<li>Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.</li>
</ul>
</div>
<br />
These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.<br />
<br />
<b>Google N-gram corpus</b><br />
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:<br />
<ul style="text-align: left;">
<li>One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.</li>
<li>Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form: </li>
<ul>
<li>the burnt_NOUN car (combination of POS tag and token queries)</li>
<li>_DET_ _NOUN_ (queries involving determiners only)</li>
</ul>
</ul>
There some restrictions on the 4 and 5-grams available, in order to prevent<br />
combinatorial explosion. <br />
<div>
<ul style="text-align: left;">
<li> Information on head-modifer relations in also available, though the relation type is not specified</li>
</ul>
You can use the<a href="https://books.google.com/ngrams" target="_blank"> Google N-gram viewer</a> to query this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc. <br />
<br />
<b>Google Syntactic N-gram corpus </b><br />
While, the traditional n-grams contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing <b><i>n</i> content words</b>. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 <i>viz. nodes, arcs, biarcs, triarcs and quadarcs.</i> There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:<br />
<ul style="text-align: left;">
<li>As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.</li>
<li>Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This information is part of the extended component of the corpus<i> ( extended-arcs, extended-arcs, etc.)</i></li>
<li>verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized. </li>
</ul>
The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc. <br />
<br /></div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-44560979277408236172013-10-19T19:54:00.003+05:302013-10-19T19:54:50.929+05:30Hierarchical Phrase Based models<div dir="ltr" style="text-align: left;" trbidi="on">
I read the David Chiang's ACL'05 paper on hierarchical phrase based models today. A quick summary:<br />
<br />
<b>Design Principles: </b><br />
<ul style="text-align: left;">
<li>Formal, but not linguistic i.e. a syncronous CFG is used, however the grammar learnt may not correspond to a linguistic ('human'?) grammar.</li>
<li>Leverage the strengths of phrase-based system while moving to syntax based </li>
</ul>
<div style="text-align: left;">
<br /><b>Basic Motivation: </b><br /><br />Basic idea is to handle long distance reorderings that a phrase based model can't handle. <br />This is done by introducing a single non-terminal 'X' and having rules of the form: <br /><br /><b> X-> a X_1 b X_2 c | d X_2 e X_1 f </b><br /><br />where the subscripts indicate relative positions of the RHS non-terminals<br /><br />In theory, the number of non-terminals on the RHS is not constrained. However, the limitation of this is that reorderings that happen at higher levels of a constituent parse tree may not be captured. The rules learnt by this system are more like lexicalized reordering templates. <br /><br />Special types of rules used: </div>
<ul style="text-align: left;">
<li> Glue rules: top level rule</li>
<li> Entity rules: for translating dates, numbers, etc.</li>
</ul>
<div style="text-align: left;">
<br /><b>Learning rules</b><br /><br />The starting point is the phrases learnt by a phrase based system, called 'initial phrase pairs' . From each initial phrase pair, rules are extracted. In order to avoid too many rules and reduce spurious derivations, some heuristics are used. One noteworthy heuristic is that rules as constructed from as small initial phrase pairs as possible. Another is that each rule can have only two non-terminals on the RHS. This is done for decoding efficiency, probably because CYK algorithm expects a grammar with CNF where every rule has two non-terminals.<br /><br /><b>The model</b><br /><br />The model is very similar to the phrase based model, a log-linear model with the same features, except that the phrase translation probabiliites are replaced by the rule translation probabilities. The probabilities are learnt in similar way. <br /><br /><b>Decoding</b><br /><br />Decoding is done via a CYK variant. The differences from a standard CYK parsing are: <br />- Parsing is done only for the source language sentence. So far so good.<br />- There is only one non-terminal. You would except this to make this to make the parsing easier. However, there is a catch. <br />- The language model of the target language has to be in integrated into the decoder. The paper says, "the language model is integrated by intersecting with the target side CFG", which I take to mean that the LM score of the sub-string spanned by a cell in the chart parsing is multiplied along with the rule weights. This means each cell has to keep track of the rule along with all the target string that the rule can generate in that span. Each <rule target_string=""> is like a virtual non-terminal, and hence the effective number of non-terminals can be really large, especially for larger spans. <br /> What I have described here is naive, and the journal paper describes different strategies for integrating the language model. I will read up on and summarize that later. <br />- The grammar is not CNF, though every rule still has only two non-terminals. I guess it is converted to CNF before decoding. <br /><br />Another interesting problem is how to kind the top-k parses. The journal article describes this in detail too. <br /><br /><b>Optimizations to decoding</b><br /><br />- Limiting the number of entries in a cell of the chart<br />- Pruning entries in the cell with very low scores as compared to the highest scoring rule in the cell</rule></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<i><b>References</b></i></div>
<ul style="text-align: left;">
<li>Chiang, David. "A hierarchical phrase-based model for statistical machine translation." <i>Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</i>. Association for Computational Linguistics, 2005.</li>
<li>Chiang, David. "Hierarchical phrase-based translation." <i>Computational Linguistics</i> 33.2 (2007): 201-228.</li>
</ul>
<div class="gs_citr" id="gs_cit0" tabindex="0">
</div>
<div style="text-align: left;">
<br /></div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-46117059309019419562012-08-28T14:53:00.004+05:302012-08-28T14:53:56.777+05:30N-gram features for text classification<div dir="ltr" style="text-align: left;" trbidi="on">
Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:<br />
<br />
<br />
<i><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49.133&rep=rep1&type=pdf" target="_blank">A Study Using n-gram Features for Text Categorization</a></i>, Johannes Furnkranz<br />
<br />
<br />
Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.<br />
<br />
<div class="gs_a" style="background-color: white; border: 0px; color: #009933; font-family: Arial, sans-serif; font-size: 13px; line-height: 16px; margin: 0px; padding: 0px; text-align: -webkit-auto;">
<br /></div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-21488219947323230452012-08-23T10:45:00.000+05:302012-08-23T10:45:18.209+05:30Origins of the Brahmi Script<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
This post is motivated by chapter 2 of James Gleick's book '', which discusses the evolution of writing. <b>Brahmi</b> is the mother script from which the scripts of all modern Indian and South-East Asian languages have evolved. It was first seen in Emporor Ashoka's rock edicts dating tno the 3rd century B.C. It is then one of the ancient world's <b>"alphabets"</b> - along with Greek, Phoenician and Aramaic. The alphabet is based on the idea that symbols represent phonemes in contrast to other writing systems like logographic (e.g. Chinese which employs symbols for words) or syllabic (e.g. Japanese where symbols represent syllables). </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
All the alphabetic scripts are said to be derived from a single script, the Phoenician. In fact, the very word 'alphabet' comes from the first two symbols in the Greek script 'Alpha' and 'Beta'. There is a lack of clarity on the origin of the Brahmi script, with two primary categories of theories. One propounds that the Brahmi evolved from the Aramaic script (itself an evolution over the Phoenician). This is based on the proposed orthographic similarities between symbols in the scripts. (See Figure).<br />
<br />
The other theory proposes an indigenous development of the Brahmi script, based on the wide differences in how the writing systems work. I tend to favour this theory, though I must admit that my knowledge of this area is limited to reading a few articles and knowing some of the modern day descendants of these scripts. The modern day alphabet of Indian scripts are organized phonetically, and there is little ambiguity phonetically - as opposed to the Roman scripts. The earliest Semitic scripts (Phoenician, Aramaic) and even modern Arabic do not have vowels, whereas the so called "true" alphabets Greek and its modern Latin derivative scripts still have room for ambiguity. Even if there was some use of symbols from the Aramaic scripts, the design seems pretty novel to call it a new style of scripting. Is there an alternative line of evolution of the script? The Indus Valley script is still undecipered - could the Brahmi have evolved from there? </div>
<div style="text-align: justify;">
<br /></div>
<br /></div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-91252107742200285432012-02-12T12:15:00.000+05:302012-02-12T12:15:28.897+05:30Indian English<div dir="ltr" style="text-align: left;" trbidi="on">
From Chandan Mitra's weekly column in the Pioneer, some hilarious examples of English usage:<br />
<br />
<br />
In a newspaper, describing a case of chain-snatching in which criminals shot dead the man who tried to resist and pursue the chain-snatchers, the reporter stated: “The deceased gave chase to the criminals who, however, managed to escape”!<br />
<br />
Police notice: “Take care of belongings. You may be theft”<br />
<br />
<br />
The article is interesting reading too.<br />
<span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-family: arial, sans-serif; font-size: 13px;"><a href="http://dailypioneer.com/columnists/item/51044-dont-fast-you-may-be-theft-indlish-is-on-a-roll.html" style="color: #1155cc;" target="_blank">http://dailypioneer.com/<wbr></wbr>columnists/item/51044-dont-<wbr></wbr>fast-you-may-be-theft-indlish-<wbr></wbr>is-on-a-roll.html</a></span><br />
<br />
<br />
<br /></div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com1tag:blogger.com,1999:blog-1879978874853957111.post-86425832633683685662012-01-14T22:35:00.001+05:302014-09-19T18:18:29.126+05:30Yet Another Moses Installation Guide<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Though Moses is a versatile MT system, its installation is still from <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="a210ccad-2d23-4496-9e40-63888236ac03" id="1d154af7-2763-4081-bd36-2a95c5924ddd">stone age</span>. Let me document here some of the key points to navigate through the installation of Moses. The intent is not to present a complete installation guide, but to highlight key issues that may crop up (as they cropped up for me). For a complete installation, <a href="http://www.statmt.org/moses_steps.html">this</a> is probably the best guide. Another useful installation guide can be found <a href="http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf">here</a>.
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
To install the Moses system, the following tools need to be installed. </div>
<div style="text-align: justify;">
</div>
<ul>
<li>Language modelling toolkit (SRILM, IRSTLM, etc.)</li>
<li>GIZA++ package which contains GIZA++ and <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="c47c5d84-4864-4e96-9901-0d941998bc30" id="e30e4e83-1a22-46b4-a145-a060f8dbbeee">mkcls</span></li>
<li>Moses decoder (version 1.0 and above)</li>
</ul>
<br />
<div>
<b>SRILM installation</b></div>
<div>
<ul style="text-align: left;">
<li>The primary installation reference is the INSTALL document that ships with the tool.</li>
<li>Install all pre-requisites mentioned in the SRILM installation guide. On Ubuntu I had to install the following packages: <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="97ba7976-2213-4774-b124-a9420e8fd96a" id="99fbfade-cc16-44f1-ab22-28a097fc95d3">csh</span>, g++-<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="97ba7976-2213-4774-b124-a9420e8fd96a" id="b102c83b-3371-4f45-9bdc-7a4c3b1c428a">multilib</span>, <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="97ba7976-2213-4774-b124-a9420e8fd96a" id="3aa80ba7-9df1-4dec-ae3e-e50a58de7a54">tcl</span>-dev</li>
<li>Set the environment variable SRILM to point to the base directory of the install package before building SRILM.</li>
<li>Following the instruction manual with the SRILM download should be enough once the pre-requisites are installed. </li>
<li>The problems you may yet face <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="5773dbe0-6055-40f3-907b-607f54eff9bb" id="b3df05ac-2ad7-4678-b3b1-8a39b8963946">are</span>: </li>
<ul>
<li>Problem in identifying the architecture, especially if <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="53f6ba1b-244f-4574-8006-d54a7012adee" id="40ec6888-bb9b-43fa-9390-4910c3d4562d">it</span> a 64-bit machine. To make sure that the install script correctly identifies the architecture, set the variable MACHINE_TYPE in <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="48bf1aa9-3ee8-4611-b9f9-93df2b43c0e0" id="efc0ce35-36ee-4e33-802c-2cd85567829a">sbin</span>/machine-type.</li>
<li>Problems with TCL compilation. You may not need the TCL user interfaces at all, so it may just be able ok to disable their compilation. Set the variable NO_TCL = X in the file common/your_architecture_specific_makefile. </li>
</ul>
<li>Make sure you have added the $SRILM/bin and $SRILM/bin/$MACHINE_TYPE to the PATH variable</li>
<li><span style="color: red;"><i>Note: </i>SRILM 1.7.1 and above are not compatible with Moses</span> </li>
</ul>
<div>
<b>IRSTLM installation</b></div>
<div>
<ul style="text-align: left;">
<li>Ubuntu packages required: libtool make autoconf autotools-dev <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="aa1dd61a-656d-41d5-887c-faa7d5337bf2" id="39c219c9-03ba-4458-b509-d39f7f2ea729">automake</span></li>
<li>The installation is pretty simple, just have to follow the installation guide</li>
<li>One caveat: Sometimes, it may be required to create a directory named 'm4' manually, if the first step mails</li>
</ul>
</div>
<div>
<br /></div>
<div>
<b>GIZA++ and <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="451bc164-abd8-4b68-97cb-8b6e02ea0f0c" id="624c117a-fbe6-4cdd-b29a-c120695fda61">mkcls</span> installation</b></div>
</div>
<div>
<div>
<ul style="text-align: left;">
<li>You get both if you download the <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="aa6a0d63-b448-43bb-82bd-c517b4d25d53" id="72a7f833-3b80-4b7a-a712-f70ecd9292a8">giza</span>-pp tool. </li>
<li>Most straightforward installation. Download and 'make'.</li>
<li>Copy the binaries - GIZA++, <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="5d96779f-a479-4137-9074-01524b002592" id="76edaeee-182c-414e-b0f3-fb434d80b0d9">mkcls</span>, snt2cooc<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="5d96779f-a479-4137-9074-01524b002592" id="643d6faf-b944-4107-acb4-f9081a0ea9f4">.</span>out to a new directory. </li>
</ul>
<div>
<b>XMLRPC Server</b></div>
</div>
<div>
<ul style="text-align: left;">
<li>XML RPC Server is required if you want to run a <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="5a3dae97-8b87-4a3b-8552-9f6728856699" id="e96d40e6-34b8-4921-ba30-c35d0710f6bd">webservice</span> providing translations. If you just want to get Moses running, you can skip this step.</li>
<li>Install the following packages: <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="57de0fab-c734-4c79-b72e-c4e262bebae0" id="471fe8c6-a62b-4096-a353-95fa42cd04c1">libxmlrpc</span>-core-c3 <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="57de0fab-c734-4c79-b72e-c4e262bebae0" id="38cb54fc-c775-4050-9df2-5917c2cf6846">libxmlrpc</span>-core-c3-dev libxmlrpc-c3-dev <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="57de0fab-c734-4c79-b72e-c4e262bebae0" id="7ba3f260-f2f5-4d2e-ac76-e69d308fe3d9">libxmlrpc</span>-c++4 <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="57de0fab-c734-4c79-b72e-c4e262bebae0" id="58a38160-1b2f-4580-a0e7-9ce4382cd91b">libxmlrpc</span>-c++4-dev </li>
</ul>
<b>Boost Library</b> <br />
The C++ Boost library is required for installation of Moses. Boost 1.48 has a serious bug which breaks Moses compilation.
<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="03e5afa6-7ce3-45a7-956b-e0aaca4a38b2" id="0eaf9d92-59af-41a0-895c-0140f6a1355a">Unfornately</span>, some Linux distributions (<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="03e5afa6-7ce3-45a7-956b-e0aaca4a38b2" id="26430f72-de55-4a28-bb11-b9943a5c30a1">eg</span>. Ubuntu 12.04) have broken
versions of the Boost library<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="40c2f481-48eb-43c6-a050-fc5c8001109c" id="f70730e5-b9a7-4886-a32b-06c6d2ca39c2">.</span>To fix this situation you can:<br />
<ul style="text-align: left;">
<li>For Ubuntu 12.04: Remove boost 1.48 from your distribution and install Boost 1.46 which is available in the distribution. This works most of the time. If not, build Boost from source as described below. </li>
<li>To install Boost manually and making it work with Moses, follow the instructions in the section titled "Manually Installing Boost" on this page: <a href="http://www.statmt.org/moses/?n=Development.GetStarted">http://www.statmt.org/moses/?n=Development.GetStarted</a> </li>
</ul>
</div>
<div>
<b>Moses installation</b></div>
</div>
<div>
<div>
<ul style="text-align: left;">
<li>The primary installation reference is the INSTALL document that ships with the tool.</li>
<li>SRILM or IRSTLM need to be installed before Moses is installed</li>
<li>Make sure you have installed the packages automake and libtool</li>
<li>Boost has to be installed</li>
<li>It is then a matter of just following the instructions. The command <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="919ab3fd-29eb-4040-b576-9a00b94cf273" id="9a8034d2-870d-4f2b-a7b0-d54e75355ca0">to be run is</span>: </li>
<li><i>/<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="e0b9a8c9-1976-45ef-8854-655e64f4a4ab" id="a45c7909-76a0-4ec3-b591-c470a5fc9579">usr</span>/bin/<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="e0b9a8c9-1976-45ef-8854-655e64f4a4ab" id="ced313b6-2bea-4256-bdf7-cfa65a4d7436">bjam</span> --with-srilm=<path><path_to_srilm><srilm_dir> --with-xmlrpc-c=<path> --with-boost=<path><path_to_xml_rpc><dir_path_to_binary></dir_path_to_binary></path_to_xml_rpc></path></path></srilm_dir></path_to_srilm></path></i></li>
<ul>
<li>If the <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="dd148105-f371-4522-ba2c-3828b942b058" id="e57b87d0-dbe1-42b5-8dc2-c7fb969d2155">xml</span> RPC is installed in /usr/bin, then the parameter would simply be '/<span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="dd148105-f371-4522-ba2c-3828b942b058" id="179392bb-3716-4a1d-acf5-e5e31f5fbfca">usr</span>'</li>
<li><i>--with-boost</i> is required only when Boost is installed in a non-standard directory. The <i>path </i>should contain both lib/lib64 and include directories </li>
</ul>
</ul>
</div>
<div>
Now Moses is ready to cross the Red Sea.<br />
<b><br /></b>
<br />
<h3 style="text-align: left;">
<b>Alternative ways of installation Moses</b></h3>
</div>
</div>
<div>
If you fail to install from the source as mentioned above, then there are a couple of simpler alternatives you can try:</div>
<div>
<br /></div>
<div>
One, use the pre-compiled binaries provided by the Moses team: </div>
<div>
<a href="http://www.statmt.org/moses/?n=Moses.Releases">http://www.statmt.org/moses/?n=Moses.Releases</a></div>
<div>
The pre-compiled version comes with IRSTLM and does not support XML-RPC to the best of my knowledge. However, it is handy to get started. </div>
<div>
<br /></div>
<div>
If that too runs into trouble, then you can try using the virtual machine provided by the Moses team. </div>
<div>
<br /></div>
<div>
<a href="http://www.statmt.org/moses/RELEASE-2.1/vm/">http://www.statmt.org/moses/RELEASE-2.1/vm/</a></div>
<div>
<br /></div>
<div>
If you are using Virtual Box, you can import the OVA images into VirtualBox. </div>
This guide many <span class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="947391e8-b70a-41d6-906c-ee1381f86354" id="98957f0c-a929-4cfd-bcd9-601384a81469">be</span> useful for importing OVA images into VirtualBox:
<br />
<a href="http://www.maketecheasier.com/import-export-ova-files-in-virtualbox/">http://www.maketecheasier.com/import-export-ova-files-in-virtualbox/</a><br />
<div>
<br /></div>
<div>
I have not tried the Virtual Images, so let me know if it works. </div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com4tag:blogger.com,1999:blog-1879978874853957111.post-74598363011687239512011-09-23T20:17:00.000+05:302011-09-23T20:17:42.474+05:30Incorporating Linguistic Information into SMT Models<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<i>(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook <a href="http://www.statmt.org/book/">'Statistical Machine translation'</a>)</i></div>
<div style="text-align: justify;">
<i><br /></i></div>
<br />
<div style="text-align: justify;">
Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like: </div>
<br />
<ul style="text-align: left;">
<li style="text-align: justify;">Name Transliteration and Number script conversions</li>
<li style="text-align: justify;">Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems</li>
<li style="text-align: justify;">Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well. </li>
</ul>
<br />
<div style="text-align: justify;">
One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:</div>
<br />
<ul style="text-align: left;">
<li style="text-align: justify;">Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.</li>
<li style="text-align: justify;">Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly. </li>
<li style="text-align: justify;">Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based. </li>
</ul>
<br />
<div style="text-align: justify;">
Similarly, some work may be done on the post processing side: </div>
<br />
<ul style="text-align: left;">
<li style="text-align: justify;">If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.</li>
<li style="text-align: justify;">Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems. </li>
</ul>
<br />
<div style="text-align: justify;">
The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind <a href="http://acl.ldc.upenn.edu/D/D07/D07-1091.pdf">the factored translation model</a>. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors). Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.</div>
<br />
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-24580682785226341142011-09-23T18:09:00.000+05:302011-09-23T18:10:00.738+05:30Language Divergence between English and Hindi<div dir="ltr" style="text-align: left;" trbidi="on">
Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper, <a href="http://www.springerlink.com/content/t1005w166746727l/">'Interlingua-based English–Hindi Machine Translation and Language Divergence'</a>, summarizes the major differences between Hindi and English.<br />
<br />
I have tried to tabulate the observations in the paper below, to make a handy reference:<br />
<br />
<br />
<table cellspacing="0" cols="3" frame="VOID" rules="NONE">
<colgroup><col width="230"></col><col width="357"></col><col width="373"></col></colgroup>
<tbody>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="230"><b>Factor</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="357"><b>English</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;" width="373"><b>Hindi</b></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Word Order</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Subject-Verb-Object</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Subject-Object-Verb</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>Ram <b>ate</b> the mango</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">राम ने आम <b>खाया </b></span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Modifiers</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Post modifier</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Premodifier</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>The Prime Minister of India</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">भारत का प्रधान मंत्री </span></i></td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>play well</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">अच्छे से खेलेंगे </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>X-positions</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Prepositions</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Postpositions</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>of India </i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">भारत का </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Overloading</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Times New Roman';"><br /></span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>John ate rice with curd</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Times New Roman';"><br /></span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>John ate rice with a spoon</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Times New Roman';"><br /></span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Compound Verbs</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">not prevelant</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">very common</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Conjunct Verbs</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">not prevelant</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">very common</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">वह गाने लगे </span></i></td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">रुक जाओ </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Times New Roman';"><br /></span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Respect</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">No special words</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Words indicating respect</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><span style="font-family: 'Lohit Hindi';">आप, हम </span></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="18" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Person</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Uses 2nd person for 3rd person</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">He obtained his degree</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">आपने अम्रीका से डिग्री प्राप्त की </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Gender</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Masculine, feminine, neuter</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Masculine, feminine</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Gender specific possesive pronouns</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">English has them</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Hindi lacks them</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>he, she</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">वह </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Morphology</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Poor</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Rich</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Null subject divergence</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Subject dropped in certain conditions</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">There was a king</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><span style="font-family: 'Lohit Hindi';">एक राजा था </span></td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">I am going</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><span style="font-family: 'Lohit Hindi';">जा रहा हूँ </span></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Pleonastic divergence</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Pleonastic dropped</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">It is raining</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">बारिश हो रही है </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Conflational divergence</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">no appropriate word</td>
</tr>
<tr>
<td align="LEFT" height="22" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>Brutus stabbed Caesar</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">ब्रूटस ने सीसर को छुरे से मारा </span></i></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Categorical divergence</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">change in POS category</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">They are competing</td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><span style="font-family: 'Lohit Hindi';">वे मुकाबला कर रहे है </span></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b>Head swapping</b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><br /></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;">Head and modifier are exchanged</td>
</tr>
<tr>
<td align="LEFT" height="17" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><b><br /></b></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i>The play is on</i></td>
<td align="LEFT" style="border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; border-top: 1px solid #000000;"><i><span style="font-family: 'Lohit Hindi';">खेल चल रहा है </span></i></td>
</tr>
</tbody>
</table>
<div>
<br /></div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-87343165556768148092011-09-21T21:48:00.000+05:302011-09-21T21:48:22.839+05:30Aligning Sentences to build a parallel corpus<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
This is a <a href="http://dl.acm.org/citation.cfm?id=972455">really old paper</a>, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned. </div>
<div>
<br /></div>
<div style="text-align: left;">
The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a random variable X: the number of charters in language L2 per character or language L1. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results: </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96. </div>
<div style="text-align: left;">
variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this: </div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTbFRWB0tjit2bwFUhJm1oH2qU1UuiCD_wumgayqU9skJ4Ky3-j5X2rA03nfD4gFZQNLL7RxNepJmOMh5x1UeCmlD4Rc7_TqSHpfYsoNUgL6kDltdaqru372hY2xWhm_blSakXd1oscew/s1600/Screenshot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="292" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTbFRWB0tjit2bwFUhJm1oH2qU1UuiCD_wumgayqU9skJ4Ky3-j5X2rA03nfD4gFZQNLL7RxNepJmOMh5x1UeCmlD4Rc7_TqSHpfYsoNUgL6kDltdaqru372hY2xWhm_blSakXd1oscew/s320/Screenshot.png" width="320" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<a href="http://code.google.com/p/nltk/source/browse/trunk#trunk%2Fnltk_contrib%2Fnltk_contrib%2Falign">NLTK provides an implementation</a> of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on? </div>
</div>
Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-47721513056694998462011-08-31T10:00:00.004+05:302011-08-31T10:31:29.139+05:30Watson - The Quiz Champion<p style="text-align: justify;">You must have heard of IBM's Watson system. It is, of course, the computer that won the Jeopardy competition against the show's previous champions. Jeopardy is a popular quiz show in which the competitors are provided clues and have to give questions that satisfy these clues. For example, a clue like '<em>This computer beat the reigning world chess champion</em>' would elicit a question '<em>Who is Deep Blue?</em>'. As you can see, the questions given by the competitors are easy questions of the nature '<em>What is</em>', '<em>Who is</em>', so the Jeopary question answer format can be considered like any other quiz show. The clues however are complex covering a wide array of topics, and could include puns, puzzles, and maths. The competitors also place bets on each questions. Competing at 'Jeopardy' thus requires the right combination of 'natural language understanding, broad knowledge, confidence and strategy'. </p><p style="text-align: justify;">Watson's victory thus represents a major milestone for natural language processing, and particularly the sub-area known as 'Question-Answering'. Question-Answering systems have great practical use for building expert systems, customer support system, decision making tools, enterprise search systems. </p><p style="text-align: justify;">Watch Watson's winning performance here: </p><iframe width="420" height="345" src="http://www.youtube.com/embed/qpKoIfTukrA?wmode=opaque" frameborder="0"></iframe>
<br /><iframe width="560" height="345" src="http://www.youtube.com/embed/YLR1byL0U8M?wmode=opaque" frameborder="0"></iframe>
<br /><p>
<br /></p><p style="text-align: justify;">This paper, <em> </em><a href="http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf" target="_blank">Building Watson: An Overview of the DeepQA project</a>, from IBM provides an overview of Watson and the DeepQA architecture that underlies it. The DeepQA architecture defines a framework for development of QA systems in an extensible and modular method, allowing different components to be customized, and to build robust QA systems that can be ported across domains. Figure 1 shows a high level diagram of the Watson's major components, and how queries are routed through it.</p><ol><li style="text-align: justify;"><strong>Query Analysis</strong>: This is the first stage, where the input clue is analyzed to determine the question category (puzzle, pune, mathematical, numeric, logical, etc.) and the answer type (person, location, organization, etc.). Complex clues are also decomposed into simpler clues. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;"><strong>Hypothesis Generation</strong>: Watson has at its disposal many sources of information like encyclopedias, books, lists of things like people, countries, etc. Watson does not attempt to get the correct answer straightaway. Instead, it first focusses on generating as many possible candidate answers, called 'hypotheses'. This is to ensure that good answers are not missed in the pursuit of the perfect answer. The attempt is to increase recall at this stage. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;"><strong>Soft Filtering:</strong> Watson may generate hundreds and thousands of hypotheses, which then have to be analyzed in detail to find the correct answer. To limit this deep analysis to only the most relevant answers, Watson filters out the bad candidates by employing a few techniques like mismatch between the expected and candidate answer type. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;"><strong>Hypothesis and Evidence scoring:</strong> Now Watson does a deep analysis of the candidate answers by employing sophisticated linguistic and statistical techniques, and looks to gather evidence for each hypothesis. This is one of the most critical parts of Watson since the evidence collected will determine how good the answer is and how confident Watson can be about it. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;"><strong>Merging and Ranking:</strong> Once the evidence is collected, the confidence scores are generated for each candidate and candidates ranked. Now, looking at the answer's confidence level Watson decides if it should answer the question or not. </li>
<br /></ol>
<br /><p><a href="http://api.ning.com/files/Iyak1H5Usv*ZxLRFvSNVrz-N9VCzet5yUboI6L0ZbUNHv-GhmwvoCsrdFwtv4YUCukQpoKd8JjQSLZP2Y2UqMPIL4m9sTOqs/deepQA.png" target="_self"><img src="http://api.ning.com/files/Iyak1H5Usv*ZxLRFvSNVrz-N9VCzet5yUboI6L0ZbUNHv-GhmwvoCsrdFwtv4YUCukQpoKd8JjQSLZP2Y2UqMPIL4m9sTOqs/deepQA.png?width=600" width="600" class="align-full" /></a></p>
<br /><p style="text-align: center;">Figure 1: DeepQA Architecture (Source: The IBM paper)</p><p style="text-align: justify;">The flexibility in the DeepQA architecture is achieved through the use of the UIMA text analysis framework. At one point in the trials, Watson was taking about two hours to generate an answer. The answer was to parallelize Watson with UIMA-AS and this got the response time down to the quiz show's average of 2 to 5 seconds. The improvement in accuracy is even more startling. When the IBM team stared working on Watson, the difference between the show's participants and early prototypes of Watson was huge. Figure 2 depicts the evolution in Watson's performance. It started from the baseline where the precision and recall were nowhere near the cloud of points corresponding to actual human competitors, but gradually reached human level performance. </p><p><a href="http://api.ning.com/files/NOSv*YHdv4D-SexdVctwURtnjiC*vHpyd4Gp8EUPdqDE4Y-LLCxi9cDD*5kBeeUnXV-yGZW4adeYjwLviUOash2z1diPgL33/watsonimprovemtn.png" target="_self"><img src="http://api.ning.com/files/NOSv*YHdv4D-SexdVctwURtnjiC*vHpyd4Gp8EUPdqDE4Y-LLCxi9cDD*5kBeeUnXV-yGZW4adeYjwLviUOash2z1diPgL33/watsonimprovemtn.png?width=600" width="600" class="align-full" /></a></p><p style="text-align: center;"> Figure 2: Watson's accuracy over time (Source: The IBM paper)</p><p style="text-align: justify;">What enabled Watson to reach this level of performance? Many of the underlying analysis algorithms aren't new, but have been around in the research community for a long time. More than groundbreaking original research, it is pragmatic engineering that lies at the core of Watson's success and the following are the salient contributory factors:</p><ul><li style="text-align: justify;">Building an end-to-end system: Very early, the team build a baseline end-to-end system and then kept iterating and improving the system. They defined end-to-end evaluation metrics which captured the performance of the system as a whole, and not focusing only on the individual component accuracies at the initial stages. This helped make the correct trade-offs. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;">Pervasive Confidence estimation: Every component in Watson gives a confidence estimate along with its response. This is critical since these confidence scores can be aggregated to get the final confidence on the answers and allows easy integration of components of varying accuracy. The rule is that no component is assumed to be perfect, but each makes available its confidence estimate of the answers. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;">Many experts: There may be competing algorithms to do the same task. Rather than using the best, the system uses multiple algorithms so as to get diverse results and evidence. The confidence estimates help to blend the diverse results. </li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;">Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.</li><div style="text-align: justify;">
<br /></div><li style="text-align: justify;">Massive parallelism: As mentioned, exploiting massive parallelism allows looking through a large number of hypotheses.</li></ul><div style="text-align: justify;">
<br /></div><div style="text-align: justify;">(PS: Cross-posted from <a href="http://peepaal.org/profiles/blogs/watson-the-quiz-champion">my Peepaal blog post</a>)</div><ul></ul>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-21401953921419086842011-07-20T19:38:00.005+05:302012-04-10T20:40:35.339+05:30Statistical Machine Translation - IBM Models<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
At CFILT, a few of us have been working on understanding the IBM Models thoroughly. The <a href="http://acl.ldc.upenn.edu/J/J93/J93-2003.pdf">IBM paper</a> on SMT is a classic and seminal paper in the history of Machine Translation, and a must read for anybody wanting to work in this area. Its not an easy read, and we spent quite a lot of time figuring out how the estimation results are derived. Some notes sprung out of working for this discussion, and works out the steps missing in the original paper in detail. Hopefully it will be useful for everybody. These scanned notes of estimation for Model 1 and Model 2 can be found <a href="https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxsJNvcAVU0HZTg1MjM5ZGYtNjI5Yi00Y2Y3LTg4ZWUtZWY2ZTY4MmQ2ZTUy&hl=en_GB">here</a>. This is not a replacement for the original paper, but is just meant to supplement the reading of the original paper. Thanks to <a href="http://www.cse.iitb.ac.in/~miteshk/">Mitesh</a> for helping out with the key steps in the derivation. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You can find the notes <a href="https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxsJNvcAVU0HZTg1MjM5ZGYtNjI5Yi00Y2Y3LTg4ZWUtZWY2ZTY4MmQ2ZTUy&hl=en_GB">here</a><br />
<br />
Update: Finally I have created a PDF of the notes for Model 1 derivation. You can find them <a href="https://docs.google.com/open?id=0BxsJNvcAVU0HU1lETkdkeS0ybmc" target="_blank">here</a>. A few slides introducing SMT can be found <a href="https://docs.google.com/open?id=0BxsJNvcAVU0HUWdqbkN6OHNnQlk" target="_blank">here</a>. </div>
</div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-9754617660200947272010-12-21T22:46:00.002+05:302010-12-21T22:49:16.402+05:30Beauty of Language<p style="text-align: left;">Language is so ambiguous, and hence so difficult to analyze. I came across an extreme example the other day, which is kind of representative of the ambiguity in dealing with language. The following sentence can have different meanings depending upon how it is spoken:<br /><br /><em>I didn't say he stole the money</em>.<br /><br />The change in meaning comes from variation in which word is given stress while speaking. Here are a few interpretations of the sentence, with the word being given stress in bold.<br /><br /><em><strong>I</strong> didn't say he stole the money</em><br />... some else may have said it<br /><br /><em>I <strong>didn't</strong> say he stole the money</em><br />... the literal meaning<br /><br /><em>I didn't <strong>say</strong> he stole the money</em><br />... just hinted, implied ??<br /><br /><em>I didn't say <strong>he</strong> stole the money</em><br />... i didn't mean him<br /><br /><em>I didn't say he <strong>stole</strong> the money</em><br />... may he just borrowed it, with the intention of returning it</p><em>I didn't say he stole <strong>the</strong> money</em><br /><p style="text-align: left;"> ... not that money</p><em>I didn't say he stole the <strong>money</strong></em><br /><p style="text-align: left;"> ... not the money, I mean something else - xyz ...<br /><br />Most common situations may not be that extreme, but just serves to highlight the challenges to understand text, and currently the state-of-the-art is just skimming the surface.</p><p style="text-align: left;">PS: Cross-posted from my <a href="http://peepaal.org/profiles/blogs/the-beauty-of-language">Peepaal blog post</a><br /></p>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-89748183995638664442010-01-26T13:55:00.003+05:302010-01-26T14:16:17.022+05:30Scalable Machine Learning - Apache Mahout<div style="text-align: justify;">Machine learning algorithms are pretty computationally intensive, work on huge amounts of data and take a lot of time to run. That makes them obvious candidates for running on data parallel distributed programming models like Map-Reduce.<br /><br />Although Google's <a href="http://labs.google.com/papers/mapreduce.html">Map-Reduce paper</a> does talk about it, there was not much available in the public domain to do machine learning on a distributed scale. <a href="http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf">Andrew Ng's paper</a> gives a common mathematical framework for modeling the most common machine learning algorithms, so that they can be parallelized. Its basically built around the idea of representing computations as summations of simpler computations. Each computation can be a map task, with the final summation being the reduce task.<br /><br /><a href="http://www.ibm.com/developerworks/java/library/j-mahout/">Apache Mahout</a> is a project from the Apache Foundation, that started off with Ng's paper and already have implementations for many ML algorithms running on Hadoop. In addition, Mahout also contains the Taste library for building recommendation systems and collaborative filtering systems.<br /><br />Hoping to read more on open source ML and practical ML. A couple of books I am looking forward to reading:<br /><ul><li><a style="font-style: italic;" href="http://oreilly.com/catalog/9780596529321">Programming Collective Intelligence</a>, Toby Seagaran</li><li><a style="font-style: italic;" href="http://www.manning.com/ingersoll/">Taming Text</a>, Grant S. Ingersoll and Thomas S. Morton</li></ul><br /></div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-31659849112010297612009-08-11T05:38:00.000+05:302009-08-11T05:39:42.816+05:30Book Review: The Numerati<div style="text-align: justify;">With the advent of the Web and the fall in electronic prices, we have seen an explosion in digital data in the form of huge databases collecting various pieces of information to ever larger collection of documents. The <a href="http://www.amazon.com/Numerati-Stephen-Baker/dp/0618784608">Numerati</a> (a portmanteau between the Number and Illuminati) are the statisticians, mathematicians, computer scientists, linguists and others involved in making sense of this data using sophisticated statistical techniques. The book describes the kind of problems being solved in the following areas, citing various examples at a bunch of organizations like IBM, Intel, Umbria, etc.:<br /></div><ul style="text-align: justify;"><li>Workers - building employee profiles, understanding employee networks, using it for optimal use of resources</li><li>Shoppers - microtargeting shoppers using personal information to customize service, give recommendations and increase sales</li><li>Voters - Understanding voter intent, issues - so that campaign messages can be targeted to focussed groups.</li><li>Bloggers - Understanding public opinion from the information on blogosphere, useful to understand sentiments on products, etc.<br /></li><li>Medicine - Baker focusses on futuristic health monitoring (like floor tiles which capture your walking patterns!), whereaas he totally ignores contemporary challenges and work in analyzing medical records, genomic and proteomic data.</li><li>Terrorism<br /></li><li>Match Making</li></ul><div style="text-align: justify;">All this comes at a cost. The Numerati has access to vast amounts of personal data, and we don't need an Orwellian Big Brother who is going to use it to learn about us, turn us into commodities and control our lives.<br /><br />That's about it in the book - it can be a brisk read, which - you can give it a miss if you think you are familiar with the above topics.</div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-59288369595680425132009-08-11T05:36:00.000+05:302009-08-11T05:37:12.441+05:30Book Review: The Lady Tasting Tea<p align="justify">A lady claims that the taste of tea differs when milk is poured to tea leaves as opposed to adding tea leaves into a cup of milk. Everyone at the small party scoffs at the suggestion, except Ronald Aylmer Fisher. Fisher designs an experiment that would statistically establish the lady's claims. He creates a sample set containing tea prepared in either ways, and lo and behold - the story goes that the lady identifies each cup correctly. Fisher uses this example to explain the design of experiments in his book 'The Design of Experiments'. This anecdote sets up the book. '<a href="http://www.amazon.com/Lady-Tasting-Tea-Statistics-Revolutionized/dp/0805071342/ref=sr_1_1?ie=UTF8&s=books&qid=1249892070&sr=1-1">The Lady Tasting Tea</a>' is the story of the development of statistics, Fisher having built the pillars of statistics as it stands today.</p><p align="justify">I started reading this book, while looking around to brush my statistics; thought it would be a good idea to know the history of the subject I am exploring. That's particularly relevant in sciences filled with uncertainties like statistics, economics, linguistics; where the characteristics of the individual seem to contribute to the development of the theory, and there's a story behind things which seem arbitrary. </p><p align="justify">David Salsburg takes us through an entertaining journey starting with the earliest breakthroughs by Karl Pearson and William Gossett, going to the pioneering foundational works of the acerbic genius Ronald Fisher, the cheerful Jerzy Newman, and the multitalented Andrei Kolmogorov. Apart from these pioneers, Salsburg very vividly sketches the lives and contributions of Egon Pearson (hypothesis testing), Chester Bliss (probit analysis), John Tukey (exploratory data analysis), Frank Wilcoxon (non-parametric methods), EJG Pitman (non-parametric methods), Prasanta Chandra Mahalabonis (sampling theory), Samuel Wilks (Founder - Statistical Research Group, Princeton) , George Box (robust statistics) and Edward Deming (statistical quality control). </p><p align="justify">Some of the chapter names are interesting, and they are as good as the title of the book. It reminds me of <a href="http://www.amazon.com/Mythical-Man-Month-Software-Engineering-Anniversary/dp/0201835959/ref=sr_1_1?ie=UTF8&s=books&qid=1249892132&sr=1-1">'The Mythical Man Month</a>''s memorable illustrative sketches. Sample this: </p><ul><li><div align="justify">The Mozart of Mathematics - Andrei Kolmorogov</div></li><li><div align="justify">The Picasso of Statistics - John Tukey</div></li><li><div align="justify">The March of the Martingales - on the work of Paul Levy</div></li></ul><p align="justify">Read this if you are a fan of scientific history. </p>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-31482024360657017122009-05-02T17:08:00.007+05:302009-05-02T18:44:35.462+05:30Text Engineering Frameworks<span style="font-weight: bold;font-size:100%;" >What is a text engineering framework? </span><br /><div style="text-align: justify;"><br />With the volume of unstructured text going through the roof, and the need to make sense of them, so are the efforts to analyze them. Different software tools for language analysis and data mining, attacking myriad language analysis problems have been developed. While each system concentrates on solving the problem at hand, there remains the enviable task of gluing together these language technologies. <span style="font-style: italic;">All language technologies need to worry about common problems like representation of data and metadata, modularization of the software components, and interaction between them.</span><br /><br />Each system takes its own approach to handling these problems, in addition to solving the central problem. This is where a text engineering system steps in. What a text engineering framework provides is an architecture and out-of-the-box support for rapid development of highly modularized, scalable language technology components, which can interface with other components - thus improving the process of creating language technology applications. The framework does all the plumbing necessary to create interesting language technology applications. <span style="font-style: italic;">A good analogy would be that the framework is the OS platform on which applications are built. </span><br /><br /><span style="font-weight: bold;font-size:100%;" >Architecture of a T</span><span style="font-weight: bold;font-size:100%;" >ext Engineering Framework</span><br /><br />While different systems may have their own architectures, the generic architecture described here is the one that forms the basis of the two most popular text engineering frameworks, <a href="http://gate.ac.uk">GATE</a> (General Architecture for Text Engineering) and <a href="http://incubator.apache.org/uima/">UIMA</a> (Unstructured Information Management Access). The two key services that the framework provides are: data/metadata management services and analysis component development services.<br /><br /><span style="font-weight: bold;font-size:100%;" >Data Management Se</span><span style="font-weight: bold;font-size:100%;" >rvices</span><br /><br />The most important problem facing NLP tools is the management of data, hence the representation of data is given a central importance in the framework. The basic unit of unstructured data to be analyzed is a <span style="font-style: italic; font-weight: bold;">Document</span>. This corresponds to a single artifact to be analyzed like a single medical report, a news article, etc. The unstructured data need not be restricted to text, but it could be audio, video and other multimedia data. The focus of this article would be text, but most the concepts elaborated here would apply to other media too. In NLP applications, it is common to process large collections of documents for analysis. The framework represents a collection of Documents by a <span style="font-style: italic; font-weight: bold;">Corpus</span> abstraction.<br /><br />Each NLP tool generates metadata for the Document. For instance, a tokeniser would generate tokens, a POS tagger would generate Part-Of-Speech tags for each token, a noun phrase chunker would identify noun phrase chunks and a named entity recognizer would generate labels for chunks of text. There needs to be a consistent method to represent all this metadata. This is achieved by using an <span style="font-weight: bold; font-style: italic;">Annotation</span> object, which represents metadata associated with a contiguous chunk of text. To illustrate the idea, consider the following sentence:<br />"<span style="font-style: italic;">In a perfect world</span><span style="font-style: italic;">, all the people would be like cats are, at two o'clock in the afternoon</span>."<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_-AgL2GKHKuFXBkRzY2y_TaNhkTKJHbJaL3PtfsjUdmtEGjgj_uar5O9c_WzSBX2pscNV4hd5Jw_gj5KUy8WFQWa9SLNfCESBbQRKQJ0u25tslkUBwleXUhADvEdZK2ru6gekKRca56E/s1600-h/annotation.jpeg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 477px; height: 109px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_-AgL2GKHKuFXBkRzY2y_TaNhkTKJHbJaL3PtfsjUdmtEGjgj_uar5O9c_WzSBX2pscNV4hd5Jw_gj5KUy8WFQWa9SLNfCESBbQRKQJ0u25tslkUBwleXUhADvEdZK2ru6gekKRca56E/s400/annotation.jpeg" alt="" id="BLOGGER_PHOTO_ID_5331210087128153058" border="0" /></a><br />The tokenizer would identify tokens, each token like "perfect" represented by an <span style="font-weight: bold;">Annotation</span>, whose type is "<span style="font-weight: bold; font-style: italic;">Token</span>". Each annotation has a start and end offset associated with it, which identifies its position in the <span style="font-weight: bold;">Document</span>. Information about the annotation can be stored in a key-value pairs called <span style="font-weight: bold; font-style: italic;">Features</span>. This allows arbitrarily complex data to be associated with the annotation. For instance, the Token annotation could have a "string" feature to represent the text of the token, a "kind" feature to indicate if the token is a word, number, or punctuation, a "root" feature which contains its morphological root.<br /><br />The scheme of representing metadata described above allows different kinds of metadata from different NLP components to be accessed and manipulated using the same interface. Positional information about the metadata can be captured, and arbitrarily complex data can be associated - since the feature values could be complex objects themselves. Annotations can be added at various levels of detail to the same chunk of text. For instance, the phrase "<span style="font-style: italic;">a perfect world</span>" can have "Token" annotations for each token, "POS" annotations to represent part-of-speech information for each token, "NP" annotation over the entire phrase to represent a noun phrase chunk. I<br /><br />It should now be obvious that the annotations constitute a data exchange format between various NLP components, to build more complex analysis of the text. An entire declarative type system can be built using these annotations for an application, as is done in UIMA. It is possible to do pattern matching over these annotations, as provided by the JAPE language in GATE. The frameworks provide implementations of these abstractions, thus freeing applications from the data management chores.<br /><br />The architecture decribed above evolved during the TIPSTER conferences . One of the popular ways of serializing this data is the XML stand-off markup, which separates the annotation metadata from the data.<br /><br /><span style="font-weight: bold;font-size:100%;" >Text Analysis Development Services</span><br /><br />NLP applications generally consist of a number of steps, each doing some part of the analysis, building upon the analysis done in the previous stage. To support this application development paradigm, the framework represents e ach NLP task by a processing resource (PR). The PR is a component which performs a single task like tokenizing, POS tagging, or something even simpler like mapping one set of annotations to another (for adaption purposes). The data interface to the PR is specified by the kind of input annotations that it requires, and the annotations it generates. For instance, the POS tagger requires "Token" annotation as input and generates "POS" annotation as output. The PR's role can be more accurately characterized as an annotator. Each PR is a reusable software component, that can be used in a creating NLP applications. The same POS tagger can be used in different applications as long as its input and output requirements are satisfied. A number of PRs can be strung together to create a pipeline. A example of an NP-chunking pipeline is shown below.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcJTtlZNf_h3c4RqsyzpoXO1Mbjf1bJmG8h5ri5ppRJ3bh3zcxAbSIoQiq7KDeUHrkr0PBRGWS1GI1B5Wo58_SjXcCrm9FeECykjd9ACoa5UldMGXgmLrVDxhm8-eZ9Hv24rMqJ97mOv0/s1600-h/pipeline.jpeg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 217px; height: 193px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcJTtlZNf_h3c4RqsyzpoXO1Mbjf1bJmG8h5ri5ppRJ3bh3zcxAbSIoQiq7KDeUHrkr0PBRGWS1GI1B5Wo58_SjXcCrm9FeECykjd9ACoa5UldMGXgmLrVDxhm8-eZ9Hv24rMqJ97mOv0/s400/pipeline.jpeg" alt="" id="BLOGGER_PHOTO_ID_5331212389277518642" border="0" /></a><br />This pipeline is a sequential pipeline, but you can as well imagine conditional, looped and other pipeline configurations. The scheme described above constitutes a modular, loosely-coupled architecture for a text engineering application. Each PR in the pipeline may be replaced by an equivalent PR as long as it satisfies the data interface requirements, allowing you to test different configurations. The framework defines the common interfaces for PRs, provides different pipeline implementations and allows for declarative specification of PRs and pipelines. In a nutshell, the framework provides all the plumbing required to build an NLP application, while the developer can focus on developing the smart innovations.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;font-size:100%;" >Other facilities provided by the framework</span><br /><br /></span>For making the application development easier,<br />1. The framework provides visual tools for managing language resources, creating pipelines, running applications, observing annotations, editing annotations, creation of training sets.<br />2. The framework may ship with off-the-shelf components for common NLP tasks like tokenization, sentence identification, dictionary lookups, POS tagging, machine learning interfaces, etc. This allows rapid prototyping of applications , using these ready-to-use components. GATE, for example, ships with the ANNIE toolkit.<br />3. The framework developers maintain a component repository, which allow the developer community to share the resusable PRs that are developed, and make use of the work done by others.<br /><br />In summary, if you are developing NLP applications you should use a text engineering framework to make use of the wealth of components that have been developed, increase productivity and build NLP applications which are modular and loosely coupled.</div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com2tag:blogger.com,1999:blog-1879978874853957111.post-1808898534529028192009-04-26T00:06:00.005+05:302009-04-26T14:34:23.337+05:30De-Identification of Personal Health Information<div style="text-align: justify;">I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.<br /><br />Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The <a href="http://privacyruleandresearch.nih.gov/pr_02.asp">HIPAA regulation</a> lays down the rules for the handling of PHI.<br /><br />Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.<br /><br />There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (<a href="http://cphs.berkeley.edu/content/hipaa/hipaa18.htm">Entire list here</a>). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.<br /><br />A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this <a href="http://www.citeulike.org/user/anoop_kunchukuttan/article/4313105">paper</a>. Here are a few de-identification systems that are available:<br /></div><ul style="text-align: justify;"><li><a href="http://www.physionet.org/physiotools/deid/">PhysioNet DeId</a> (Open Source)<br /></li><li><a href="http://spin.nci.nih.gov/content/HMS_Scrubber_v1.0b.zip">Harvard Medical School Scrubber</a> (Open Source)</li><li><a href="http://www.de-idata.com/">Data Corp DeId</a> (Commercial)</li></ul><div style="text-align: justify;">For research purposes, a gold standard data set containing surrogate PHI data is available on the <a href="http://www.physionet.org/physiotools/deid/#data">PhysioNet page</a>.</div>Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com0tag:blogger.com,1999:blog-1879978874853957111.post-21560223548605461362009-04-25T22:39:00.002+05:302009-04-25T23:46:15.403+05:30Yet Another Blog On Organizing InformationData and information everywhere. The digital age is generating so much information, that it has fast outgrown our ability to comprehend it. 'Information Overload', we call it. These are the questions that are posed to us:<br /><ul><li>How do I find information that I want?</li><li>What information is relevant to my need?</li><li>Ok, this is way too much information than I can handle. I would like to have summary of the same.</li><li>In this huge infobase, is there some useful information that isn't obvious? Some patterns, trends that may be useful.</li><li>There are a lot of smart people generating content. How can the collective intelligence of these people augment my search for information? </li></ul>These questions have had us hooked for a long time, and so have the solutions people have developed to tackle these questions. Search engines to help you find information, business intelligence tools to make find patterns in huge volumes of data, information extraction systems to summarize information in human generated content, recommendation systems to bring information relevant to your need and study of social networks to harness the "collective intelligence" of the crowd.<br /><br />The rabbit hole goes deeper. These solutions are built on the more fundamental sciences of statistics, pattern recognition, artificial intelligence and natural language understanding.<br /><br />This is not the end, for the more fundamental questions we are posed with are about the nature of cognition, the understanding of language, the organization of the knowledge and the active role of the human observer in the perception of information. I think this is the holy grail that we are all in pursuit of.<br /><br />We are beginners in this exciting field,. This is a place to share what we learn, what we do and to benefit from the "collective intelligence" of all who visit this page.<br /><br />While the challenges span many problems, there are some that we are currently working on. Dhaval currently works on optimizing ad-networks and takes an active interest in search engines. I currently work on information extraction from text and medical informatics. So for now you may find a certain bias towards these topics, and related topics on this blog.Anoop Kunchukuttanhttp://www.blogger.com/profile/03230469717630854695noreply@blogger.com2