Spark is, in our opinion, the new reference Big Data processing framework. Its flexible API allows for unified batch and stream processing, and can be easily extended for many purposes. It also incorporates a new API called Data Frames which makes it terribly easy to analyze data while providing a solid columnar storage by default (Parquet). Spark is changing fast, currently going through a big refactor (Project Tungsten) which will bring it to the next level.
We see currently Spark as the default tool for building Big Data projects. Those who, like us, have been working with Hadoop’s MapReduce for a long time, encounter many subtle (and sometimes not so subtle) differences when approaching Big Data problems with both frameworks. This post covers some of them.
Today we will talk about one particular problem: How to group our data in Spark, given that some groups might contain a tremendous amount of data?