Blog

Building a parallel text classifier in Hadoop with Pangool

Some weeks ago we introduced Pangool, a low-level, easy-to-use library that aims to be a replacement for the default Hadoop API. In this post we’ll show how to build a simple distributed text classifier with Pangool.

This example will show us how, despite the fact that we are working with a low-level library, the most common tasks end up being extraordinarily simple to perform, so that in a short amount of time and with just a few lines of code we are able to develop something as complete as a scalable text classifier. We’re going to implement a text classifier that will perform a “sentiment analysis” task. Given an input text, we’ll expect the classifier to tell us whether the text’s sentiment is positive or negative. The examples that we will be working on will be hotel reviews, such as: “Fantastic hotel!”. Text classification is a “machine learning” task very commonly used nowadays for performing tasks like detecting spam, categorizing texts and so on.

Our classifier will be based on Naive Bayes with “add-one smoothing”.

Read more…


Pangool: Hadoop API made easy

We are proud to announce Pangool, an Open Source java library with the aim to be a replacement for the Hadoop API. Hadoop has a steep learning curve. Pangool’s goal is to simplify Hadoop development without losing the performance or flexibility of the low level Hadoop’s API.


Pangool

Pangool is a Tuple MapReduce implementation for Hadoop. By employing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear. Things like secondary sort and reduce-side joins become extremely easy to implement and understand. Pangool’s performance is comparable to that of the Hadoop Java MapReduce API. Pangool also augments Hadoop’s API by making multiple outputs and inputs first-class and allowing configuration via object instance instead of static classes.

Read more…


Tuple MapReduce: beyond the classic MapReduce

It’s been some years now since Google wrote the paper [“MapReduce: Simplified Data Processing on Large Clusters“] in 2004. In this paper Google presented MapReduce, a programming model and associated implementation for solving parallel computation problems with big-scale data. This model is based on the use of the functional primitives “map” and “reduce” present in LISP and other functional languages.

Today, Hadoop, the “de facto” open-source implementation of MapReduce, is used by a wide variety of companies, institutions and universities. The massive usage of this programming model has led to the creation of multiple tools associated with it (which has come to be known as the Hadoop ecosystem) and even specialized companies like Cloudera engaged in training programmers to use it. Part of the success of such tools and companies lies in the now-evident difficulty and sharp learning curve involved in MapReduce, as it was originally defined, when applied to practical problems.

In this post we’ll review the MapReduce model proposed by Google in 2004 and propound another one called Tuple MapReduce. We’ll see that this new model is a generalization of the first and we’ll explain what advantages it has to offer. We’ll provide a practical example and conclude by discussing when the implementation of Tuple MapReduce is advisable.

Read more…


MapReduce & Hadoop API revised

Nowadays, Hadoop has become the key technology behind what has come to be known as “Big Data”. It has certainly worked hard to earn this position. It is mature technology that has been used successfully in countless projects. But now, with experience behind us, it is time to take stock of the foundations upon which it is based, particularly its interface. This article discusses some of the weaknesses of both MapReduce and Hadoop, which we, at Datasalt, shall attempt to resolve with an open-source project that we will soon be releasing.

Read more…


Real-time feed processing with Storm

One of the most interesting open-source projects that has been released in recent months is Storm, programmed by Nathan Marz from Backtype (now in Twitter). This video shows the basic cases for using Storm and why Storm is needed. There’s also a very good wiki where we can learn all sorts of things about Storm.

In this article I will present a practical use case that we can implement with Storm and I will discuss why Storm is a very good tool for this use case.

Read more…


Front-end view generation with Hadoop

One of the most common uses for Hadoop is building “views”. The usual case is that of websites serving data in a front-end that uses a search index. Why do we want to use Hadoop to generate the index being served by the website? There are several reasons: Read more…


Scalable vertical search engine with Hadoop

The following slides show how to build a scalable vertical search engine using Hadoop and Solr:


Playing with Scala & Hadoop

In this post we’ll try executing a Hadoop task with Scala, a recent modern language that combines the best features of functional languages with the strong typing and object-orientation of Java. The advantages that Scala programming offers with respect to Java are a more concise, briefer code as well as the possibility of applying functional programming methods that would otherwise be nearly impossible to apply in Java. Read more…


Hadoop + Avro

In Hadoop projects that have to handle large amounts of data, the serialization we use to process and store these data can be a key decision in optimizing the time and cost of our pipeline.

We can serialize our data in different ways:

Format Main advantage Main disadvantage
In plain text, using tab-separated fields, for example. Data is easily readable by humans. Only one nesting level – arrays are expressed as comma-separated element lists.
In JSON. Flexible, multi-level schema. Waste of storage space: each field has its associated name written with it.
In binary format: Hadoop Serialization (Writables). More efficient in both space and speed. Not flexible (not “backwards compatible”: changes in the format makes it impossible to read data that was written previous to the format change).
In binary format: Using Thrift, Avro or similar APIs. More efficient in both space and speed. Flexible. Which one to choose?

Read more…


Handling dependencies and configuration in Java + Hadoop projects efficiently

When working on Java Hadoop projects it is common to encounter the situation where you have to add a large amount of JAR dependencies which will be needed by all of the tasks that are executed in any node of the cluster (Mappers, Reducers…). Synchronizing our libraries with a Hadoop cluster is not as simple as synchronizing them in a standard Java project.

There are three possible solutions to this scenario worth mentioning. First, we will show the two that we advise against and the drawbacks they have. Finally, we will talk about the one we recommend. Read more…