It’s been some years now since Google wrote the paper [“MapReduce: Simplified Data Processing on Large Clusters“] in 2004. In this paper Google presented MapReduce, a programming model and associated implementation for solving parallel computation problems with big-scale data. This model is based on the use of the functional primitives “map” and “reduce” present in LISP and other functional languages.
Today, Hadoop, the “de facto” open-source implementation of MapReduce, is used by a wide variety of companies, institutions and universities. The massive usage of this programming model has led to the creation of multiple tools associated with it (which has come to be known as the Hadoop ecosystem) and even specialized companies like Cloudera engaged in training programmers to use it. Part of the success of such tools and companies lies in the now-evident difficulty and sharp learning curve involved in MapReduce, as it was originally defined, when applied to practical problems.
In this post we’ll review the MapReduce model proposed by Google in 2004 and propound another one called Tuple MapReduce. We’ll see that this new model is a generalization of the first and we’ll explain what advantages it has to offer. We’ll provide a practical example and conclude by discussing when the implementation of Tuple MapReduce is advisable.