Pere Ferrera Bertran

A scalable groupByKey and secondary sort for (Java) Spark

Spark is, in our opinion, the new reference Big Data processing framework. Its flexible API allows for unified batch and stream processing, and can be easily extended for many purposes. It also incorporates a new API called Data Frames which makes it terribly easy to analyze data while providing a solid columnar storage by default (Parquet). Spark is changing fast, currently going through a big refactor (Project Tungsten) which will bring it to the next level.

We see currently Spark as the default tool for building Big Data projects. Those who, like us, have been working with Hadoop’s MapReduce for a long time, encounter many subtle (and sometimes not so subtle) differences when approaching Big Data problems with both frameworks. This post covers some of them.

Today we will talk about one particular problem: How to group our data in Spark, given that some groups might contain a tremendous amount of data?

Read more…

SQL on Hadoop: A state of the art

Since the mainstream adoption of Hadoop, many open-source and enterprise tools have emerged recently which aim to solve the problem of “querying Hadoop”. Indeed, Hadoop started as a simple (yet powerful) Big Data framework consisting of (1) a storage layer (HDFS, inspired by GFS) and (2) a processing layer (an implementation of MapReduce). Only after a couple of years the first releases of HBase (inspired by BigTable) started to make data on Hadoop more actionable. After that, several NoSQL’s and query engines started to work on Hadoop connectors (SOLR-1301 is one of the oldest). However, the problem of “querying Hadoop” remained unsolved in two main aspects:

  • Flexibility: Query engines outside Hadoop had discerning features and query languages. Many NoSQL’s were limited by their nature to key/value serving. Choosing a NoSQL always involved a tradeoff, and choosing more than one (the so-called “Polyglot Persistence”) made things harder.
  • Integration: Making data queryiable by a query engine was a challenge by itself (How does this database integrate with Hadoop? What does the connector look like? How mature it is? How efficient is it? Can I update all my data view atomically, in a ‘big transaction’?).

Read more…

Lambda Architecture: A state-of-the-art

It’s been some time now since Nathan Marz wrote the first Lambda Architecture post. What has happened since then? How has the community reacted to such a concept? What are the architectural trends in the Big Data space,  as well as the challenges and remaining problems?

Read more…

Parsing Qype reviews with Pangool and saving results into MongoDB

In this post we will see how easy it is to integrate a Pangool MapReduce Job with MongoDB, the famous document-oriented NoSQL database. For that, we will perform a review scraping task on Qype HTML pages. The resultant reviews will be then persisted into MongoDB.


The “Burgermeister” in Berlin is a famous burger venue. But what do really people think about it? Has their opinion changed recently? A lot of companies and start-ups work nowadays on solving this problem. ReviewPro helps hotels know what people think of them. BrandWatch helps companies watch people’s opinions on their brand. In this post we will perform a simple review parsing job for Qype HTML pages. The given implementation will be very simple, but we will also give an idea on how to extend it to a “real-life” solution.

Read more…

A practical Storm’s Trident API Overview

On the 10th of April Pere gave a Trident hackaton at Berlin’s Big Data Beers. There was also a parallel Disco hackaton by Dave from Continuum Analytics. Es war viel spaß! The people who came had the chance to learn the basics of Trident in the Storm session while trying it right away. The hackaton covered the most basic aspects of the API, the philosophy and typical use cases for Storm and included a simple exercise that manipulated a stream of fake tweets. The project with the session guideline, some runnable examples and the tweet generator can be found on github.


In this post we will see an introductory overview of Trident’s API that can be followed with the help of the aforementioned github project.

Read more…

Presenting Splout Cloud: a managed web-latency SQL querying engine in the cloud

We have created Splout Cloud, a web-latency managed service in the AWS cloud. Simply put, Splout Cloud converts any data files – regardless of their size – into a scalable, partitioned SQL querying engine. And unlike offline analytics engines, it is highly performant, fast enough so as to be able to provide sub-second queries and real-time aggregations to feed arbitrary web or mobile applications.


Splout Cloud is based on Splout SQL, an open-source SQL database for Hadoop.

Read more…

Pig + Splout SQL for a retail coupon generator: A Big Data love story

(This is the last post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for a big food retail store. We will summarize individual client’s purchases using Apache Pig and we will dump the analysis into Splout SQL for being able to query it in real-time. Then, we will be able to combine the summarized information with a list of promotional products and suggest discounts on particular products for every client, in real-time. This information could be easily used by a coupon printing system for increasing client loyalty.

Combining an agile Big Data processing tool such as Pig with a flexible, low-latency SQL querying system such as Splout SQL provides a simple yet effective solution to this problem, which we will be able to simulate through this post almost with no effort.

Read more…

Hive + Splout SQL for a social media reporting webapp: A Big Data love story

(This is the second post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing tweets and reporting consolidated, meaningful statistics to Twitter users through an interactive, low-latency webapp. For that we will marry Hive (an open-source warehousing solution for Hadoop that enables easy analysis and summarization over Hadoop) with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We will build a very simple – yet scalable – analysis tool such as the Tweet archivist, and we will do it without even coding. The tool will provide historical mentions, summarized hashtags and popular tweets for every actor in the input dataset.

Read more…

Cascading + Splout SQL for log analysis and serving: A Big Data love story

(This is the first post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing and indexing a large amount of Apache logs from an e-Commerce website, and being able to serve them in a low-latency “customer service” web application that needs fine-grained, detailed per-user information for troubleshooting and performing “loyalty campaigns”. For that we will marry Cascading, an agile Java high-level Hadoop framework with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We’ll see how to develop a solution which is totally scalable both in processing and serving and develop it in barely 200 lines. We’ll also provide a simple JavaScript frontend that will use Splout SQL’s REST API via jQuery.

Read more…

An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL

In this post we will show how to use Trident, Hadoop and Splout SQL together to build a toy example “lambda architecture“. We will learn the basics of Trident, a higher-level API on top of Storm, and Splout SQL, a fast SQL read-only DB for Hadoop. The example architecture is hosted on this github project. We will simulate counting the number of appearances of hashtags in tweets, by date. The ultimate goal is to solve this simple problem in a fully scalable way, and provide a remote low-latency service for querying the evolution of the counts of a hashtag, including both consolidated and real-time statistics for it.

So for any hashtag we want to be able to query a remote service to obtain per-date counts in a data structre like this:


Read more…