On Organizing Information: 2010

Tuesday, December 21, 2010

Beauty of Language

Language is so ambiguous, and hence so difficult to analyze. I came across an extreme example the other day, which is kind of representative of the ambiguity in dealing with language. The following sentence can have different meanings depending upon how it is spoken:

I didn't say he stole the money.

The change in meaning comes from variation in which word is given stress while speaking. Here are a few interpretations of the sentence, with the word being given stress in bold.

I didn't say he stole the money
... some else may have said it

I didn't say he stole the money
... the literal meaning

I didn't say he stole the money
... just hinted, implied ??

I didn't say he stole the money
... i didn't mean him

I didn't say he stole the money
... may he just borrowed it, with the intention of returning it

I didn't say he stole the money

... not that money

I didn't say he stole the money

... not the money, I mean something else - xyz ...

Most common situations may not be that extreme, but just serves to highlight the challenges to understand text, and currently the state-of-the-art is just skimming the surface.

PS: Cross-posted from my Peepaal blog post

Tuesday, January 26, 2010

Scalable Machine Learning - Apache Mahout

Machine learning algorithms are pretty computationally intensive, work on huge amounts of data and take a lot of time to run. That makes them obvious candidates for running on data parallel distributed programming models like Map-Reduce.

Although Google's Map-Reduce paper does talk about it, there was not much available in the public domain to do machine learning on a distributed scale. Andrew Ng's paper gives a common mathematical framework for modeling the most common machine learning algorithms, so that they can be parallelized. Its basically built around the idea of representing computations as summations of simpler computations. Each computation can be a map task, with the final summation being the reduce task.

Apache Mahout is a project from the Apache Foundation, that started off with Ng's paper and already have implementations for many ML algorithms running on Hadoop. In addition, Mahout also contains the Taste library for building recommendation systems and collaborative filtering systems.

Hoping to read more on open source ML and practical ML. A couple of books I am looking forward to reading:

Programming Collective Intelligence, Toby Seagaran
Taming Text, Grant S. Ingersoll and Thomas S. Morton