Archive for January, 2013

An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL

In this post we will show how to use Trident, Hadoop and Splout SQL together to build a toy example “lambda architecture“. We will learn the basics of Trident, a higher-level API on top of Storm, and Splout SQL, a fast SQL read-only DB for Hadoop. The example architecture is hosted on this github project. We will simulate counting the number of appearances of hashtags in tweets, by date. The ultimate goal is to solve this simple problem in a fully scalable way, and provide a remote low-latency service for querying the evolution of the counts of a hashtag, including both consolidated and real-time statistics for it.

So for any hashtag we want to be able to query a remote service to obtain per-date counts in a data structre like this:

{
  "20091022":115,
  "20091023":115,
  "20091024":158,
  "20091025":19
}

Read more…


Announcing Splout SQL: A web-latency spout for Hadoop

Today we are proud to announce Splout SQL, a web-latency SQL spout for Hadoop.

We have been working hard for the last months preparing the release of what we think is a novel solution for serving Big Data. We have taken ideas from projects such as Voldemort or ElephantDB to bring SQL back to the realm of web-latency Big Data serving.

Motivation

The motivation for Splout SQL is well explained in this post.

Splout SQL fundamentals

Splout SQL is a scalable, fault-tolerant, read-only, partitioned, RESTful, SQL database that integrates tightly with Hadoop. Splout SQL can be used to query Hadoop-generated datasets under high-load, in sub-second latencies. This means Splout SQL is suitable for web or mobile applications that demand a high level of performance. As an example, we have used Splout for serving 350GB worth of data from the “Wikipedia Pagecounts” dataset, performing arbitrary aggregations over it (see the benchmark page for more details).

The integration with Hadoop follows the same design principles that those from Elephant DB (which are well explained in these slides). In a nutshell, data is indexed and partitioned offline in a Hadoop cluster. The generated data structures are then moved to a read-only serving cluster, which is the one that answers user queries.

Read more…