← blog

SQL on Hadoop: A state of the art

Since the mainstream adoption of Hadoop, many open-source and enterprise tools have emerged recently which aim to solve the problem of “querying Hadoop”. Indeed, Hadoop started as a simple (yet powerful) Big Data framework consisting of (1) a storage layer (HDFS, inspired by GFS) and (2) a processing layer (an implementation of MapReduce). Only after a couple of years the first releases of HBase (inspired by BigTable) started to make data on Hadoop more actionable. After that, several NoSQL’s and query engines started to work on Hadoop connectors (SOLR-1301 is one of the oldest). However, the problem of “querying Hadoop” remained unsolved in two main aspects:

  • Flexibility: Query engines outside Hadoop had discerning features and query languages. Many NoSQL’s were limited by their nature to key/value serving. Choosing a NoSQL always involved a tradeoff, and choosing more than one (the so-called “Polyglot Persistence”) made things harder.
  • Integration: Making data queryiable by a query engine was a challenge by itself (How does this database integrate with Hadoop? What does the connector look like? How mature it is? How efficient is it? Can I update all my data view atomically, in a ‘big transaction’?).

After some years of “NoSQL hipe”, voices from the past – saying: SQL, SQL, … – appeared as a solution for Hadoop’s querying flexibility. It is true that Sqoop (as old as HADOOP-5815) already made it possible to connect Hadoop with RDBMs, however this didn’t really solve the integration part of the problem. It is also true that “Batch SQL” was available since first releases of Hive (late 2008), but latency-speaking this never really made Hadoop queryiable, but rather “MapReduce avoidable”.

Even though SQL is a 40-year-old language, it has many elements which makes it convenient as a querying language for Hadoop:

  • It provides the necessary querying flexibility for calculating arbitrary aggregates without needing to pre-calculate all of them beforehand.
  • It is familiar to many people and companies, therefore making it possible to reuse knowledge.
  • It supports joins, thus not forcing the underlying data model be completely denormalized.
  • It makes it easy to integrate existent BI and visualization tools through JDBC or ODBC.

The rise of “SQL on Hadoop” solutions in the last years can be also explained by the maturity of Hadoop. Since Hadoop is now really being adopted by more and more companies, these have pushed the market to offer SQL on top of it. As an example, we see more and more mixed Data Warehouse architectures (see this good paper for reference on the matter).

In this post we will review the state of the art of Hadoop’s SQL querying engines. We will provide a table with the use cases each engine is oriented towards, namely: (1) Batch SQL (long-lived processes that read and write arbitrarily big datasets), (2) Interactive querying (MPP-like times for ad-hoc queries, in the order of seconds or minutes), (3) Point querying (sub-second reads, for web and mobile applications) and (4) Operational SQL (read+writes, ACID transactions). Sometimes an engine will overlap various use cases, which is expected. We will also try to incorporate not so well known solutions, as current “SQL on Hadoop” reports are slightly biased towards mainstream solutions. We will focus mainly on open-source solutions.

There are many details and intricacies in all of those, and it is difficult to compare them (although MapR did a very good job in this post). We recommend analyzing all of them in detail in the following dimensions: community, maturity, SQL completeness, JDBC/ODBC integration, only in-memory or disk-based, data format support, …

  1. Hive: It is a well-known tool on the ecosystem since late 2008. It is evolving into an interactive querying engine through “Stinger Phase III”. At Berlin’s Big Data Beers we had the opportunity to hear an update on this project.
  2. LINGUAL: It is Cascading’s SQL interface, akin to Hive, but unlike Hive it aims to support pure ANSI SQL. As far as Cascading runs on Tez in the future, LINGUAL will as well.
  3. Apache Tajo: Apache Top-level Project since 2014-03-24, designed for both batch and interactive SQL, it supports standard SQL and has cost-based join optimizer and extendable rewrite rule engines. It uses its own DAG-style distributed processing engine.
  4. Impala: Cloudera’s popular SQL on Hadoop. At Berlin’s Big Data Beers we had the opportunity to hear an update on its roadmap, which seems to include disk-based joins and aggregators for the future.
  5. Apache Drill: Analytical ad-hoc querying engine, targeted to arbitrary engines and formats. Unlike Impala, community-driven.
  6. Shark: Spark’s SQL layer, compatible with Hive.
  7. BigSQL: Interactive querying through PostgreSQL. This is quite new and it took a me a while to realize, but I finally found the proof. It has nothing to do with IBM’s interactive SQL querying, a propietary thing inside IBM BigInsights.
  8. Presto: Facebook’s SQL on Hadoop, aking to Apache Drill, with pluggable backends.
  9. Hadapt (before, HadoopDB), a propietary technology which collocates RDBMs engines (PostgreSQL) in a Hadoop cluster, and delegates queries to them.
  10. RainStor: Proprietary MPP SQL engine, it has its own storage format which aims to compress even more than ORC.
  11. Splout SQL: Decoupled serving engine from Hadoop, generates partitioned SQL files, which allows to use pure SQL and arbitrary indexes on them.
  12. Phoenix: SQL on HBase co-processors, oriented towards low-latency, recently accepted as Apache incubator project.
  13. JethroData: Indexes every column of data as it arrives. First version expected to be available soon. Proprietary code with community edition available.
  14. Spire: Real-time SQL on Hadoop. Ever since this announcement I’ve been wondering about this product, but now it seems to be kind of dead?
  15. Splice Machine: General purpose SQL on Hadoop for both OLTP and OLAP use cases, supporting ACID transactions.
  16. HAWK: Proprietary engine with EMC behind, leveraging research and development on Greenplum.
Engine Batch Interactive Point-querying Operational
Hive Yes Subject to Stinger (Tez)
LINGUAL Yes In the future, if Cascading supports Tez
Apache Tajo Yes Yes
Impala Yes Some form of, with HBase
Apache Drill Yes Some form of, with HBase
Shark Yes Yes
BigSQL Yes Yes
Presto Yes Some form of, with HBase
Hadapt Yes
RainStor Yes
Splout SQL Limited to a single partition, non-columnar storage Yes
Phoenix Yes Yes
JethroData Yes Yes
Spire Yes Yes Yes
Splice Machine Yes Yes Yes (ACID)
HAWK Yes Yes Yes (ACID)

Do you think there is some engine missing here? Do you see something wrong? Please contact us if so.



6 Comments on "SQL on Hadoop: A state of the art"

  1. Hyunsik Choi says:

    Hi Pere Ferrera Bertran,

    You might miss Apache Tajo which is one of the most promising SQL-on-hadoop system. Apache Tajo is a top-level project in Apache software foundation. As you can see, it is actively developed in ASF and has real use cases.

    Here is the informative links about Apache Tajo.
    * http://tajo.apache.org
    * https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces53
    * http://gigaom.com/2014/04/01/apache-tajo-sql-on-hadoop-engine-now-a-top-level-project/
    * http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space/
    * https://www.ohloh.net/p/tajo

    Also, I’d like to help you fill your summary table. Tajo is designed for both interactive and batch. Currently, it does not support point-querying and is not operational. In addition, Tajo supports SQL standard and has cost-based join optimizer and extendable rewrite rule engines. Tajo uses HDFS as a primary storage and uses its own DAG-style distributed processing engine.

    Thanks,
    Hyunsik Choi

  2. Thanks Hyunsik, I just added it!

  3. Hi,

    RainStor has an MPP SQL engine that runs natively on HDFS, as well as on NAS, Cloud and CAS. We’ve got production deployments on the major Hadoop distros. We were also recently certified on Cloudera 5.

    We’re under the ‘interactive SQL’ category. Mainly used for low-latency analytics in support of structured data archiving, although we do support SQL DELETE, so we’re not strictly read-only. We support SQL92, and have extended our grammar to include SQL features from Oracle, SQLServer and Sybase.

    We support schema versioning, data replication, encryption from top to bottom, data masking, retention rules, audit trails, and very aggressive data compression.

    Our compressed file format is similar (but more compressed) than ORC and Parquet. In addition to accessing rainstor data on HDFS through our MPP SQL engine, you can also access the same data through Hive, Pig and MapReduce.

    http://rainstor.com/solutions/rainstor-for-hadoop/

    Thanks,

    Mark Cusack

  4. Thanks Mark, I added RainStor as well.

  5. Very useful and comprehensive list – thanks for putting it together!

    A couple of clarifications regarding JethroData:
    - Point-querying is one of the main objectives of JethroData, and a key benefits of our full-indexing design
    - We expect a beta version to be available for download shortly
    - While we do not currently plan to open-sourse our code, we will provide a free community version, as well as, a paid-for enterprise version

    thanks, eli

  6. Thanks Eli, made the appropriate changes.

Leave a Comment