In this post we will review the current approaches there are for servicing Big Data, that is, for being able to process an arbitrary number of queries with sub-second latencies in a scalable cluster of machines over a huge dataset and under high load.
Think Twitter, Facebook, Linkedin. Think servicing Hadoop-generated datasets.
What are the possibilities that the open-source world gives us for building a website whose queries impact such a huge dataset? What are the most common problems we might encounter in such a scenario and how well do these tools solve this problem?
Then, we will propose a new architecture that provides a scalable yet rich solution for this problem.