In this post we will review the current approaches there are for servicing Big Data, that is, for being able to process an arbitrary number of queries with sub-second latencies in a scalable cluster of machines over a huge dataset and under high load.
Think Twitter, Facebook, Linkedin. Think servicing Hadoop-generated datasets.
What are the possibilities that the open-source world gives us for building a website whose queries impact such a huge dataset? What are the most common problems we might encounter in such a scenario and how well do these tools solve this problem?
Then, we will propose a new architecture that provides a scalable yet rich solution for this problem.
Servicing Big Data
What does it mean to service Big Data? It means basically two things:
1) Being able to answer queries in sub-second latencies under heavy loads. We are not talking about analytic, ad-hoc queries. We are talking about web latencies.
2) Being able to update the underlying data, sometimes replacing it entirely with minimal impact on 1)
And all the above using technologies that scale horizontally.
Database-centric systems and their complexities.
Nowadays with all the NoSQL and NewSQL products around we are surely tempted to build a database-centric system for solving this problem. We can incrementally add all the data we need to a database and just let the same database service queries for us.
But it is not novel to say that database-centric systems are complex to manage and maintain. There are a few commonly stated ideas that sustain this:
- Evolving schemas is difficult.
- Managing incremental state is harsh and error-prone.
- Programming errors are hard to repair.
Nathan Marz explains this very well together with his Lambda Architecture proposal.
Batch processing to the rescue.
We have learned that it is a well-proven and convenient recipe to use a distributed storage and batch-oriented computing platform such as Hadoop to keep the whole raw input data and then processing it in order to extract value from data.
Scalable batch-processing simplifies developers and devops life, making it easy to change everything from one day to another without pain. And, on top of that, it scales well.
Following this philosophy you have a complete new data view from your input dataset at the end of each batch process ready to be pushed to the servicing database. The process of updating the servicing database with this data is what we call deploying.
This way of doing things is certainly a paradigm shift. Rather than incrementally updating your data view, you pre-generate it all at once. The drawback: it is not real-time. But, as Nathan says (see the link above), the benefits of the batch-oriented view generation are so big that it is worthy building a secondary system for dealing with the real-time part of your system, rather than trying to fit everything into the same system.
From now on, we will focus on the batch part of the “lambda”.
Spouts for Big Data pre-generated views
But once we have precomputed a data view, what are the options we have nowadays for efficiently deploying it? Frankly, there are not many.
Given that Hadoop gives us the power to pre-compute all our view data structures, we expect to be able to deploy them seamlessly, that is, being able to deploy them atomically (for example, by performing a simple move between file folders). However this is not a common feature, and yet the most common way of deploying precomputed views to a servicing data store is incrementally loading them with random writes, which is both error-prone and a potential bottleneck for the system.
Currently, the better options we have are two:
- Key / Value stores: Stores such as Voldemort or ElephantDB allow for seamlessly atomic data updates from Hadoop.
- Free text: SOLR has a core-swap option which allows for atomic data view updates.
However, there are tons of NoSQL and NewSQL products out there with a wide variety of query systems, none of which play really well with Hadoop or, generally speaking, Big Data bulk-loading. The only way of using them is through connectors such as Sqoop which makes bulk data loading error-prone and inefficient. Some of them, like MongoDB, even have their own Hadoop Output Formats which, despite the fact they are thought to be used with Hadoop, they just perform random writes to the underlying data store. This, together with the fact that the hard work of these databases (as an example, creating B-Trees) cannot be done offline, makes it not efficient and advisable to use them in conjunction with Hadoop.
The niche: Big Data + Batch processing + SQL
So from our point of view there is something clearly missing in the Big Data space. There is the need for a tool which makes loading and servicing pre-computed Big Data views easy and seamlessly. And such a tool doesn’t need to be as restrictive as a Key / Value store. For example, it could support SQL.
Why care about SQL? You say
Even though SQL is not as fashionable as it used to be, SQL database engines provide key functionalities such as real-time dynamic grouping, ranges, and more. Nowadays, you either pre-compute all your groups or you are done. But what if you want to pre-compute only some groups and be able to perform groupings on the fly at the same time?
For instance, consider a “Google analytics”-like application where you have precomputed all the pageviews of your page for each day. Couldn’t you provide to the user the possibility of aggregating all the pageviews between two arbitrary dates? (And don’t try to precompute it because there are far too many possibilities!)
We didn’t renounce to SQL because of SQL. We renounced to database-centric monolithic engines for being able to cope with big amounts of data. Now we want to bring back SQL to the real-time querying domain on Big Data.
The basic idea is to take a step backwards in what we have learned so far and use a non-distributed data store, making it distributed by leveraging Hadoop and a basic coordination system. Because we isolate the servicing from the data structures creation, this data store can then be anything: an embedded SQL or anything else you want.
Simply put, such a system behaves as follows:
- The user provides a big dataset.
- The user provides a Partitioning function and a Key selection function. The “Key selection function” gives the system a key for each row of the dataset. The “Partitioning function” then operates on that key for deciding which partition will the record go to.
- The partitioning function can be omitted and then the system will automatically calculate it for evenly spreading the data across partitions (for example, by ranges, performing sampling over the keys).
- The system builds all the underlying data view structures in a scalable building cluster (e.g. Hadoop) and deploys them seamlessly to a servicing cluster. The system also takes care of versioning the existing data set for allowing rollback operations.
- Users can query the data views by providing one or more “partition keys” which are then used for redirecting the queries to the appropriated partition. This is the analogous to a Key / Value store where the user indicates a key and receives a value. In this case, because queries are rich, multiple keys may be specified and several engines impacted. (To keep it simple, we don’t add a query parser to the system that automatically detects which shards need to be queried, we make it explicit.)
The most remarkable points of this system are, on one hand, that any datastore can be used underneath, notably datastores that don’t provide any sharding or parallel architecture as long as their structures and indexes can be pre-generated.
On the other hand, sharding and balancing is straightforward because it is handled offline in a batch process, so the overall system is very simple and yet horizontally scalable. (But notice how, for example, random writes are not supported by this system. Indeed, random writes are what make systems more complex).
Details like replication for failover are hidden. There are some difficult things as well to achieve in such a system which involve coordination, for instance, handling data loading errors properly and always servicing a consistent data version to the user. All this can be achieved by using a coordination system such as ZooKeeper.
So, in order to recapitulate:
- Experience has shown us the benefits of batch processing for tackling Big Data problems. Nathan Marz explains this very well together with the idea of a Lambda Architecture.
- Hadoop is a great product for managing Big Data batch processes.
- There are many NoSQL and NewSQL products offering a wide variety of query engines but very few seem to play well with Hadoop.
- We have shown that it is possible to build a system that plays well with Hadoop and yet offer a rich query language like SQL for servicing web-latency queries.
- We have been building such a system and we are open-sourcing it very soon, so stay tuned!