It’s been some time now since Nathan Marz wrote the first Lambda Architecture post. What has happened since then? How has the community reacted to such a concept? What are the architectural trends in the Big Data space, as well as the challenges and remaining problems?
Big Data: Batch processing-only
Despite the attractiveness of a dual batch / real-time architecture, there exists a wide variety of problems in Big Data for which a batch layer is good enough, and I think it will continue to be so.
The consolidated adoption of Hadoop, together with the dramatic improvement of its available tools, makes it today often the main architectural requirement for solving many Big Data challenges. With SQL-on-Hadoop tools such as Impala or Apache Drill it is no longer the case that data entering Hadoop is not rapidly actionable. Operational applications – e.g. involving recommendations, long-history analytics, client segmentation – benefit from a richer ecosystem, resulting in shorter development cycles; as well as improved hardware with better underlying software – therefore shorter batch cycle times. More frameworks to choose (Spark, Stratosphere) and future paradigms like Tez (remember Dryad?) complete the batch-processing scenario.
(The story of Spiderio’s architectural evolution came into my mind, switching from real-time to batch at some point.)
Big Data: Real-time processing-only
Targeted to businesses where real-time is crucial, we are starting to see interesting real-time solutions such as Druid. Stream processing and NoSQL-centric solutions remain a trend (remember Facebook’s move to HBase?). Storm improved its default API by adding Trident, which adds exactly-once semantics – and Google announced Millwheel, which, as usual, seems to take the state-of-the-art one step further. Add Spark Streaming to the mix.
Still challenging to think about the human fault-tolerancy guarantees of such systems, their complexity and their overall price. And interesting to see how they kind of meet in the middle, incorporating batch storage / indexing sometimes as part of their architecture.
“Unified” Lambda Architectures
What seemed once impossible, is starting to be achieved: developing Lambda Architectures using a “unified” framework which makes the same code available to both the real-time and batch layer, and combines the results of both layers transparently. Two tools are in the scope for achieving that:
- Summingbird: Based on “monoids”, supporting Hadoop and Storm as backing infrastructure. Used at Twitter with “Manhattan” as read-only store and Memcached as “real-time” store.
- Lambdoop: Based on user-defined “operations”, Avro schemas, supports Hadoop and Storm as backing infrastructure and uses HBase as batch store and Redis as real-time store.
With Lambdoop not yet released and Summingbird being a very recent release, the usefulness of these frameworks is yet to be seen. It might be the case that they are too complex or restrictive, or that they evolve naturally into something very usable. By now the kind of operations that one can implement using these frameworks seem somehow restricted by the nature of the frameworks themselves, and the final data serving one can expect is restricted to pure key/value serving.
“Free” Lambda Architectures
What still seems to me the most common case of Lambda Architecture, is a “literal” implementation of it where each layer operates independently and results are merged depending on the business nature.
When things like complex documents and search-engines come into play, it becomes harder to reason about an hypothetic “unified” framework that would provide the guarantees and semantics of a batch-processing layer and the immediateness of a speed layer. The guys at Trovit made their system coordinate a batch and speed layer using Zookeeper.
As another example, at Datasalt we implemented analytics for Ad Networks using Splout SQL as the batch read-only datastore. This allows us to precalculate less things and calculate real-time aggregations on-the-fly i.e. ad-hoc drill-downs. Tied with Splout, it is very easy to deploy a real-time SQL layer that compliments it using well-proven technologies such as MySQL, keeping in mind that the hardest problems are already solved by the batch layer.
I also remember the guys at GameDuell somehow managed to mix Hadoop and VoltDB bi-directionally with quick, few-minutes model re-calculations.
With things like Yahoo’s Storm-Hadoop for YARN, and Big Data tools being nowadays easier and easier to install and manage (e.g. Cloudera Manager or Apache Ambari making life much easier) it will become less of a pain to deploy an open-source Big Data architecture, implement batch-processing and colocate it with real-time processing. What’s more, as stated, many cases will still be easily solvable just by batch processing. And as we have seen, newer real-time processing frameworks may already provide re-indexing / re-processing semantics underneath.
Because of the many possible use cases, data natures and serving requirements, it is unclear to me whether frameworks like Summingbird or Lambdoop will be key to implementing Lambda Architectures. But who knows, maybe the best is yet to come!