Lambda Architecture: A state-of-the-art
It’s been some time now since Nathan Marz wrote the first Lambda Architecture post. What has happened since then? How has the community reacted to such a concept? What are the architectural trends in the Big Data space, as well as the challenges and remaining problems?
In this post we share the talk (and the slides) Datasalt gave about analytics for ad networks at the past Big Data Spain.
In the talk, we sketch the architecture of the whole system. Specifically, we talk about how Splout SQL was designed and why it is useful for the case of Ad Networks. Other techniques that were needed for the problem, like sampling and in-memory storage, are stated as well.
The post will finish with a brief explanation of the topics covered by the talk. Enjoy it!
In this post we will see how easy it is to integrate a Pangool MapReduce Job with MongoDB, the famous document-oriented NoSQL database. For that, we will perform a review scraping task on Qype HTML pages. The resultant reviews will be then persisted into MongoDB.
” in Berlin is a famous burger venue. But what do really people think about it? Has their opinion changed recently? A lot of companies and start-ups work nowadays on solving this problem. ReviewPro
helps hotels know what people think of them. BrandWatch
helps companies watch people’s opinions on their brand. In this post we will perform a simple review parsing job for Qype
HTML pages. The given implementation will be very simple, but we will also give an idea on how to extend it to a “real-life” solution.
On the 10th of April Pere gave a Trident hackaton at Berlin’s Big Data Beers. There was also a parallel Disco hackaton by Dave from Continuum Analytics. Es war viel spaß! The people who came had the chance to learn the basics of Trident in the Storm session while trying it right away. The hackaton covered the most basic aspects of the API, the philosophy and typical use cases for Storm and included a simple exercise that manipulated a stream of fake tweets. The project with the session guideline, some runnable examples and the tweet generator can be found on github.
In this post we will see an introductory overview of Trident’s API that can be followed with the help of the aforementioned github project.
We have created Splout Cloud, a web-latency managed service in the AWS cloud. Simply put, Splout Cloud converts any data files – regardless of their size – into a scalable, partitioned SQL querying engine. And unlike offline analytics engines, it is highly performant, fast enough so as to be able to provide sub-second queries and real-time aggregations to feed arbitrary web or mobile applications.
Splout Cloud is based on Splout SQL, an open-source SQL database for Hadoop.
(This is the last post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).
In this post we’ll present an example Big Data use case for a big food retail store. We will summarize individual client’s purchases using Apache Pig and we will dump the analysis into Splout SQL for being able to query it in real-time. Then, we will be able to combine the summarized information with a list of promotional products and suggest discounts on particular products for every client, in real-time. This information could be easily used by a coupon printing system for increasing client loyalty.
Combining an agile Big Data processing tool such as Pig with a flexible, low-latency SQL querying system such as Splout SQL provides a simple yet effective solution to this problem, which we will be able to simulate through this post almost with no effort.
(This is the second post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).
In this post we’ll present an example Big Data use case for analyzing tweets and reporting consolidated, meaningful statistics to Twitter users through an interactive, low-latency webapp. For that we will marry Hive (an open-source warehousing solution for Hadoop that enables easy analysis and summarization over Hadoop) with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We will build a very simple – yet scalable – analysis tool such as the Tweet archivist, and we will do it without even coding. The tool will provide historical mentions, summarized hashtags and popular tweets for every actor in the input dataset.
(This is the first post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).
In this post we will show how to use Trident, Hadoop and Splout SQL together to build a toy example “lambda architecture“. We will learn the basics of Trident, a higher-level API on top of Storm, and Splout SQL, a fast SQL read-only DB for Hadoop. The example architecture is hosted on this github project. We will simulate counting the number of appearances of hashtags in tweets, by date. The ultimate goal is to solve this simple problem in a fully scalable way, and provide a remote low-latency service for querying the evolution of the counts of a hashtag, including both consolidated and real-time statistics for it.
So for any hashtag we want to be able to query a remote service to obtain per-date counts in a data structre like this:
Today we are proud to announce Splout SQL, a web-latency SQL spout for Hadoop.
We have been working hard for the last months preparing the release of what we think is a novel solution for serving Big Data. We have taken ideas from projects such as Voldemort or ElephantDB to bring SQL back to the realm of web-latency Big Data serving.
The motivation for Splout SQL is well explained in this post.
Splout SQL fundamentals
Splout SQL is a scalable, fault-tolerant, read-only, partitioned, RESTful, SQL database that integrates tightly with Hadoop. Splout SQL can be used to query Hadoop-generated datasets under high-load, in sub-second latencies. This means Splout SQL is suitable for web or mobile applications that demand a high level of performance. As an example, we have used Splout for serving 350GB worth of data from the “Wikipedia Pagecounts” dataset, performing arbitrary aggregations over it (see the benchmark page for more details).
The integration with Hadoop follows the same design principles that those from Elephant DB (which are well explained in these slides
). In a nutshell, data is indexed and partitioned
offline in a Hadoop cluster. The generated data structures are then moved to a read-only serving cluster
, which is the one that answers user queries.