Today we are proud to announce Splout SQL, a web-latency SQL spout for Hadoop.
We have been working hard for the last months preparing the release of what we think is a novel solution for serving Big Data. We have taken ideas from projects such as Voldemort or ElephantDB to bring SQL back to the realm of web-latency Big Data serving.
The motivation for Splout SQL is well explained in this post.
Splout SQL fundamentals
Splout SQL is a scalable, fault-tolerant, read-only, partitioned, RESTful, SQL database that integrates tightly with Hadoop. Splout SQL can be used to query Hadoop-generated datasets under high-load, in sub-second latencies. This means Splout SQL is suitable for web or mobile applications that demand a high level of performance. As an example, we have used Splout for serving 350GB worth of data from the “Wikipedia Pagecounts” dataset, performing arbitrary aggregations over it (see the benchmark page for more details).
Elephant DB does something similar. Project Voldemort can do that as well. Inspired by them, we have summarized concepts such as hot-swapping, rollback, versions and domains / stores / tablespaces into a simple and friendly API for deploying read-only datasets.
But Splout can create arbitrary tables / B-Trees, offering a real SQL view over the final dataset. Therefore Splout SQL is a richer, more powerful extension of those key / value stores, which allows for arbitrary real-time SQL aggregations over the generated views.
How it works
Data structures needed to answer queries are generated using “splout-hadoop” command-lines or API methods. The API uses a Pangool batch process underneath for performing the data distribution and indexing. The data files generated are no more than SQLite binary files and they are saved into the HDFS or any other compatible file system (for example, S3). Those files are then pulled from a different cluster (the serving cluster, running Splout SQL daemons) under a deploy request, and this distributed, asynchronous deploy is coordinated by Hazelcast until success or failure.
On the serving side, user queries are routed through a “QNode” service, who knows the distribution of the underlying dataset. “DNode” services are the ones who hold the SQL files, answer the queries and send the results back upstream. The result of the query is finally returned to the user as a JSON through its HTTP interface.
- The system is able to distribute the data evenly among shards by using sampling techniques (such as the one used by TeraSort and algorithms such as Reservoir Sampling).
- In order to sample and distribute the data there must be some sort of “partition key” concept. In Splout, datasets can be partitioned by any function of a row. Usually this will be a column or a set of columns of the dataset. This function allows the sampling process to determine the histogram of the dataset and distribute the data evenly into the desired number of partitions.
- Queries are always routed to a single partition, so the user must provide the partition key for every query. This key must of course be consistent with the partitioning strategy used. (This is the equivalent of providing a “key” to get a value in a key / value store. In this case, you provide a “key” and you get back the result of a SQL query).
- Fail-over is achieved through straight-forward replication: each generated partition is copied to multiple DNodes so that if one of them fails, the other ones can still serve that partition.
- The user can create any number of tables inside tablespaces, and deploy an arbitrary number of tablespaces. Tables can be partitioned or replicated to each partition (for example, small tables can be configured to be fully replicated to every partition so that join queries are always possible within any partition).
- Deploys are all-or-nothing and don’t affect query serving, so the overall system is very robust, simple and yet very flexible. And the user can rollback to previous versions, atomically.
Splout SQL in the Big Data space
We have designed Splout SQL to fill a gap in the Big Data space. We think Splout is the best candidate “Batch Layer” of a “Lambda architecture“.
Splout SQL integrates well with Hadoop, can answer low-latency queries and is prepared to stand datasets that exceed available RAM just like Voldemort or Elephant DB, but it offers SQL real-time aggregations instead of just key / value lookups.
On the other hand, NoSQL databases such as MongoDB, Cassandra and HBase, although not being SQL, they offer a wide variety of features, may be used in conjunction with Hadoop and may seem to overlap with Splout SQL’s objectives. However, these are much more complex systems to use and manage. And as far as we know, at the date of this document none of them provide a bridge with Hadoop that 1) is easy to use and manage 2) doesn’t affect query serving 3) is all-or-nothing, with the possibility of rollback.
And finally, there are emerging “fast analytics” engines on top of Hadoop that deliver fast SQL queries over Hadoop-generated files (Impala, Apache Drill). However, as far as we know, they are conceived for off-line, analytics applications. They are not designed to stand multi-threaded, heavy-traffic environments with low-latency queries in the sub-second realm. They may leverage column-oriented storage for making scans really fast, but they lack the necessary B-Trees to fast-lookup any row. Overall, these are suitable for queries that impact the whole dataset, whereas Splout SQL is made to answer a lot of queries that span very small portions of the dataset.
Splout is not 1.0 yet but we will appreaciate your feedback, so try it.