Pere Ferrera Bertran

A practical Storm’s Trident API Overview

On the 10th of April Pere gave a Trident hackaton at Berlin’s Big Data Beers. There was also a parallel Disco hackaton by Dave from Continuum Analytics. Es war viel spaß! The people who came had the chance to learn the basics of Trident in the Storm session while trying it right away. The hackaton covered the most basic aspects of the API, the philosophy and typical use cases for Storm and included a simple exercise that manipulated a stream of fake tweets. The project with the session guideline, some runnable examples and the tweet generator can be found on github.

BDB4-2

In this post we will see an introductory overview of Trident’s API that can be followed with the help of the aforementioned github project.

Read more…


Presenting Splout Cloud: a managed web-latency SQL querying engine in the cloud

We have created Splout Cloud, a web-latency managed service in the AWS cloud. Simply put, Splout Cloud converts any data files – regardless of their size – into a scalable, partitioned SQL querying engine. And unlike offline analytics engines, it is highly performant, fast enough so as to be able to provide sub-second queries and real-time aggregations to feed arbitrary web or mobile applications.

sploutcloud-logo

Splout Cloud is based on Splout SQL, an open-source SQL database for Hadoop.

Read more…


Pig + Splout SQL for a retail coupon generator: A Big Data love story

(This is the last post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for a big food retail store. We will summarize individual client’s purchases using Apache Pig and we will dump the analysis into Splout SQL for being able to query it in real-time. Then, we will be able to combine the summarized information with a list of promotional products and suggest discounts on particular products for every client, in real-time. This information could be easily used by a coupon printing system for increasing client loyalty.

Combining an agile Big Data processing tool such as Pig with a flexible, low-latency SQL querying system such as Splout SQL provides a simple yet effective solution to this problem, which we will be able to simulate through this post almost with no effort.

Read more…


Hive + Splout SQL for a social media reporting webapp: A Big Data love story

(This is the second post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing tweets and reporting consolidated, meaningful statistics to Twitter users through an interactive, low-latency webapp. For that we will marry Hive (an open-source warehousing solution for Hadoop that enables easy analysis and summarization over Hadoop) with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We will build a very simple – yet scalable – analysis tool such as the Tweet archivist, and we will do it without even coding. The tool will provide historical mentions, summarized hashtags and popular tweets for every actor in the input dataset.

Read more…


Cascading + Splout SQL for log analysis and serving: A Big Data love story

(This is the first post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing and indexing a large amount of Apache logs from an e-Commerce website, and being able to serve them in a low-latency “customer service” web application that needs fine-grained, detailed per-user information for troubleshooting and performing “loyalty campaigns”. For that we will marry Cascading, an agile Java high-level Hadoop framework with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We’ll see how to develop a solution which is totally scalable both in processing and serving and develop it in barely 200 lines. We’ll also provide a simple JavaScript frontend that will use Splout SQL’s REST API via jQuery.

Read more…


An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL

In this post we will show how to use Trident, Hadoop and Splout SQL together to build a toy example “lambda architecture“. We will learn the basics of Trident, a higher-level API on top of Storm, and Splout SQL, a fast SQL read-only DB for Hadoop. The example architecture is hosted on this github project. We will simulate counting the number of appearances of hashtags in tweets, by date. The ultimate goal is to solve this simple problem in a fully scalable way, and provide a remote low-latency service for querying the evolution of the counts of a hashtag, including both consolidated and real-time statistics for it.

So for any hashtag we want to be able to query a remote service to obtain per-date counts in a data structre like this:

{
  "20091022":115,
  "20091023":115,
  "20091024":158,
  "20091025":19
}

Read more…


Announcing Splout SQL: A web-latency spout for Hadoop

Today we are proud to announce Splout SQL, a web-latency SQL spout for Hadoop.

We have been working hard for the last months preparing the release of what we think is a novel solution for serving Big Data. We have taken ideas from projects such as Voldemort or ElephantDB to bring SQL back to the realm of web-latency Big Data serving.

Motivation

The motivation for Splout SQL is well explained in this post.

Splout SQL fundamentals

Splout SQL is a scalable, fault-tolerant, read-only, partitioned, RESTful, SQL database that integrates tightly with Hadoop. Splout SQL can be used to query Hadoop-generated datasets under high-load, in sub-second latencies. This means Splout SQL is suitable for web or mobile applications that demand a high level of performance. As an example, we have used Splout for serving 350GB worth of data from the “Wikipedia Pagecounts” dataset, performing arbitrary aggregations over it (see the benchmark page for more details).

The integration with Hadoop follows the same design principles that those from Elephant DB (which are well explained in these slides). In a nutshell, data is indexed and partitioned offline in a Hadoop cluster. The generated data structures are then moved to a read-only serving cluster, which is the one that answers user queries.

Read more…


A richer database spout for Big Data

In this post we will review the current approaches there are for servicing Big Data, that is, for being able to process an arbitrary number of queries with sub-second latencies in a scalable cluster of machines over a huge dataset and under high load.

Think Twitter, Facebook, Linkedin. Think servicing Hadoop-generated datasets.

What are the possibilities that the open-source world gives us for building a website whose queries impact such a huge dataset? What are the most common problems we might encounter in such a scenario and how well do these tools solve this problem?

Then, we will propose a new architecture that provides a scalable yet rich solution for this problem.
Read more…


Pangool + SOLR

In this blog we have already talked about the convenience of using Hadoop for batch-indexing for producing a Lucene inverted index, for instance, that would be deployed into SOLR. We have talked about SOLR-1301, which is commonly used for SOLR indexing with Hadoop. In this article we’re going to present the recent integration of Pangool with SOLR which is a lot simpler and more powerful.

Read more…


Pangool’s Game Of Life

In this post we’re going to do something crazier and more original than usual. We thought that it would be interesting to execute Conway’s Game of Life in Hadoop using Pangool (for those who don’t know what Pangool is, it is an open-source project that we have developed for facilitating low-level development with Hadoop). Through a parallel execution we’ll be able to calculate sequences of the game from multiple initial configurations. The ultimate purpose of the execution will be to find curious sequences, such as sequences that need a lot of iterations to reach convergence even with a small, finite space.

Read more…