Archive for April, 2012

Building a parallel text classifier in Hadoop with Pangool

Some weeks ago we introduced Pangool, a low-level, easy-to-use library that aims to be a replacement for the default Hadoop API. In this post we’ll show how to build a simple distributed text classifier with Pangool.

This example will show us how, despite the fact that we are working with a low-level library, the most common tasks end up being extraordinarily simple to perform, so that in a short amount of time and with just a few lines of code we are able to develop something as complete as a scalable text classifier. We’re going to implement a text classifier that will perform a “sentiment analysis” task. Given an input text, we’ll expect the classifier to tell us whether the text’s sentiment is positive or negative. The examples that we will be working on will be hotel reviews, such as: “Fantastic hotel!”. Text classification is a “machine learning” task very commonly used nowadays for performing tasks like detecting spam, categorizing texts and so on.

Our classifier will be based on Naive Bayes with “add-one smoothing”.

Read more…