Since the mainstream adoption of Hadoop, many open-source and enterprise tools have emerged recently which aim to solve the problem of “querying Hadoop”. Indeed, Hadoop started as a simple (yet powerful) Big Data framework consisting of (1) a storage layer (HDFS, inspired by GFS) and (2) a processing layer (an implementation of MapReduce). Only after a couple of years the first releases of HBase (inspired by BigTable) started to make data on Hadoop more actionable. After that, several NoSQL’s and query engines started to work on Hadoop connectors (SOLR-1301 is one of the oldest). However, the problem of “querying Hadoop” remained unsolved in two main aspects:
- Flexibility: Query engines outside Hadoop had discerning features and query languages. Many NoSQL’s were limited by their nature to key/value serving. Choosing a NoSQL always involved a tradeoff, and choosing more than one (the so-called “Polyglot Persistence”) made things harder.
- Integration: Making data queryiable by a query engine was a challenge by itself (How does this database integrate with Hadoop? What does the connector look like? How mature it is? How efficient is it? Can I update all my data view atomically, in a ‘big transaction’?).
After some years of “NoSQL hipe”, voices from the past – saying: SQL, SQL, … – appeared as a solution for Hadoop’s querying flexibility. It is true that Sqoop (as old as HADOOP-5815) already made it possible to connect Hadoop with RDBMs, however this didn’t really solve the integration part of the problem. It is also true that “Batch SQL” was available since first releases of Hive (late 2008), but latency-speaking this never really made Hadoop queryiable, but rather “MapReduce avoidable”.
Even though SQL is a 40-year-old language, it has many elements which makes it convenient as a querying language for Hadoop:
- It provides the necessary querying flexibility for calculating arbitrary aggregates without needing to pre-calculate all of them beforehand.
- It is familiar to many people and companies, therefore making it possible to reuse knowledge.
- It supports joins, thus not forcing the underlying data model be completely denormalized.
- It makes it easy to integrate existent BI and visualization tools through JDBC or ODBC.
The rise of “SQL on Hadoop” solutions in the last years can be also explained by the maturity of Hadoop. Since Hadoop is now really being adopted by more and more companies, these have pushed the market to offer SQL on top of it. As an example, we see more and more mixed Data Warehouse architectures (see this good paper for reference on the matter).
In this post we will review the state of the art of Hadoop’s SQL querying engines. We will provide a table with the use cases each engine is oriented towards, namely: (1) Batch SQL (long-lived processes that read and write arbitrarily big datasets), (2) Interactive querying (MPP-like times for ad-hoc queries, in the order of seconds or minutes), (3) Point querying (sub-second reads, for web and mobile applications) and (4) Operational SQL (read+writes, ACID transactions). Sometimes an engine will overlap various use cases, which is expected. We will also try to incorporate not so well known solutions, as current “SQL on Hadoop” reports are slightly biased towards mainstream solutions. We will focus mainly on open-source solutions.
There are many details and intricacies in all of those, and it is difficult to compare them (although MapR did a very good job in this post). We recommend analyzing all of them in detail in the following dimensions: community, maturity, SQL completeness, JDBC/ODBC integration, only in-memory or disk-based, data format support, …
- Hive: It is a well-known tool on the ecosystem since late 2008. It is evolving into an interactive querying engine through “Stinger Phase III”. At Berlin’s Big Data Beers we had the opportunity to hear an update on this project.
- LINGUAL: It is Cascading’s SQL interface, akin to Hive, but unlike Hive it aims to support pure ANSI SQL. As far as Cascading runs on Tez in the future, LINGUAL will as well.
- Apache Tajo: Apache Top-level Project since 2014-03-24, designed for both batch and interactive SQL, it supports standard SQL and has cost-based join optimizer and extendable rewrite rule engines. It uses its own DAG-style distributed processing engine.
- Impala: Cloudera’s popular SQL on Hadoop. At Berlin’s Big Data Beers we had the opportunity to hear an update on its roadmap, which seems to include disk-based joins and aggregators for the future.
- Apache Drill: Analytical ad-hoc querying engine, targeted to arbitrary engines and formats. Unlike Impala, community-driven.
- Shark: Spark’s SQL layer, compatible with Hive.
- BigSQL: Interactive querying through PostgreSQL. This is quite new and it took a me a while to realize, but I finally found the proof. It has nothing to do with IBM’s interactive SQL querying, a propietary thing inside IBM BigInsights.
- Presto: Facebook’s SQL on Hadoop, aking to Apache Drill, with pluggable backends.
- Hadapt (before, HadoopDB), a propietary technology which collocates RDBMs engines (PostgreSQL) in a Hadoop cluster, and delegates queries to them.
- RainStor: Proprietary MPP SQL engine, it has its own storage format which aims to compress even more than ORC.
- Splout SQL: Decoupled serving engine from Hadoop, generates partitioned SQL files, which allows to use pure SQL and arbitrary indexes on them.
- Phoenix: SQL on HBase co-processors, oriented towards low-latency, recently accepted as Apache incubator project.
- JethroData: Indexes every column of data as it arrives. First version expected to be available soon. Proprietary code with community edition available.
- Spire: Real-time SQL on Hadoop. Ever since this announcement I’ve been wondering about this product, but now it seems to be kind of dead?
- Splice Machine: General purpose SQL on Hadoop for both OLTP and OLAP use cases, supporting ACID transactions.
- HAWK: Proprietary engine with EMC behind, leveraging research and development on Greenplum.
|Hive||Yes||Subject to Stinger (Tez)|
|LINGUAL||Yes||In the future, if Cascading supports Tez|
|Impala||Yes||Some form of, with HBase|
|Apache Drill||Yes||Some form of, with HBase|
|Presto||Yes||Some form of, with HBase|
|Splout SQL||Limited to a single partition, non-columnar storage||Yes|
|Splice Machine||Yes||Yes||Yes (ACID)|
Do you think there is some engine missing here? Do you see something wrong? Please contact us if so.