Since the mainstream adoption of Hadoop, many open-source and enterprise tools have emerged recently which aim to solve the problem of “querying Hadoop”. Indeed, Hadoop started as a simple (yet powerful) Big Data framework consisting of (1) a storage layer (HDFS, inspired by GFS) and (2) a processing layer (an implementation of MapReduce). Only after a couple of years the first releases of HBase (inspired by BigTable) started to make data on Hadoop more actionable. After that, several NoSQL’s and query engines started to work on Hadoop connectors (SOLR-1301 is one of the oldest). However, the problem of “querying Hadoop” remained unsolved in two main aspects:
- Flexibility: Query engines outside Hadoop had discerning features and query languages. Many NoSQL’s were limited by their nature to key/value serving. Choosing a NoSQL always involved a tradeoff, and choosing more than one (the so-called “Polyglot Persistence”) made things harder.
- Integration: Making data queryiable by a query engine was a challenge by itself (How does this database integrate with Hadoop? What does the connector look like? How mature it is? How efficient is it? Can I update all my data view atomically, in a ‘big transaction’?).