Tuesday, January 26, 2010

Scalable Machine Learning - Apache Mahout

Machine learning algorithms are pretty computationally intensive, work on huge amounts of data and take a lot of time to run. That makes them obvious candidates for running on data parallel distributed programming models like Map-Reduce.

Although Google's Map-Reduce paper does talk about it, there was not much available in the public domain to do machine learning on a distributed scale. Andrew Ng's paper gives a common mathematical framework for modeling the most common machine learning algorithms, so that they can be parallelized. Its basically built around the idea of representing computations as summations of simpler computations. Each computation can be a map task, with the final summation being the reduce task.

Apache Mahout is a project from the Apache Foundation, that started off with Ng's paper and already have implementations for many ML algorithms running on Hadoop. In addition, Mahout also contains the Taste library for building recommendation systems and collaborative filtering systems.

Hoping to read more on open source ML and practical ML. A couple of books I am looking forward to reading: