Some weeks ago we introduced Pangool, a low-level, easy-to-use library that aims to be a replacement for the default Hadoop API. In this post we’ll show how to build a simple distributed text classifier with Pangool.
This example will show us how, despite the fact that we are working with a low-level library, the most common tasks end up being extraordinarily simple to perform, so that in a short amount of time and with just a few lines of code we are able to develop something as complete as a scalable text classifier. We’re going to implement a text classifier that will perform a “sentiment analysis” task. Given an input text, we’ll expect the classifier to tell us whether the text’s sentiment is positive or negative. The examples that we will be working on will be hotel reviews, such as: “Fantastic hotel!”. Text classification is a “machine learning” task very commonly used nowadays for performing tasks like detecting spam, categorizing texts and so on.
Our classifier will be based on Naive Bayes with “add-one smoothing”.




Nowadays, Hadoop has become the key technology behind what has come to be known as “Big Data”. It has certainly worked hard to earn this position. It is mature technology that has been used successfully in countless projects. But now, with experience behind us, it is time to take stock of the foundations upon which it is based, particularly its interface. This article discusses some of the weaknesses of both MapReduce and Hadoop, which we, at 