In this post we share the talk (and the slides) Datasalt gave about analytics for ad networks at the past Big Data Spain.
In the talk, we sketch the architecture of the whole system. Specifically, we talk about how Splout SQL was designed and why it is useful for the case of Ad Networks. Other techniques that were needed for the problem, like sampling and in-memory storage, are stated as well.
The post will finish with a brief explanation of the topics covered by the talk. Enjoy it!
Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Each time an ad is shown in a web page or in a mobile application an impression event is generated. These impressions and other events are the source of analytic panels that are used by the agents (advertiser and publisher) to analyze the performance of its campaigns or its web pages.
Presenting these panels to the agents is a technical challenge because Ad Networks have to deal with billions of events every day, and present interactive panels to thousands of agents. The scale of the problem requires the usage of distributed tools. Obviously, Hadoop may come to the rescue with its storage and computing capacity. It can be used to precompute several statistics that are later presented in the panels.
But that is not enough for the agents. In order to perform exploratory analysis, they need an interactive panel that allows to filter down by a particular web page, country and device in a particular time-frame, or any other ad-hoc filter you can think of.
Therefore, something more than Hadoop is needed in order to store the data, process it and be able to perform arbitrary drill-downs on it. At Datasalt, we have addressed this problem for some clients and we have found a solution that is presented in the talk.
The solution includes two modules: the off-line and the on-line.
The off-line module is in charge of storing the received events and preforming the most costly operations: cleaning the dataset, performing some aggregations in order to reduce the size of the data and creating some file structures that later will be used to serve the on-line analytics. All these tasks are handled properly by Hadoop. The most innovative part on this process is the last step where some file-structures are created for being exported to the on-line part in order to serve the analytics panels.
The on-line module is in charge of serving the analytic queries received from the agents’ panel webapp. The queries are basic statistics (count, count distinct, stddev, sum, etc) run over a subset of the input dataset represented by an ad-hoc filter. The challenge here is that the system has to serve statistics for filters “on the fly”. That makes it impossible to precalculate everything on the off-line side. Therefore, part of the calculations must be done on-demand. That would not be a problem if the scale of the data wouldn’t be that big – however, in our case, some kind of scalable serving engine is needed for this task.
Datasalt has developed the open-source distributed database Splout SQL (http://sploutsql.com/) that allows to create SQL materialized views over Hadoop data. It scales by the use of partitioning.
Splout SQL is the perfect tool for serving on-line statistics at scale for Ad Networks as the data can be partitioned by agent. In this way, it is possible to isolate queries corresponding to an individual agent to a particular partition.
A full SQL Lambda Architecture
With Splout SQL as batch serving engine, it is very easy to colocate a SQL real-time layer to form a full-SQL Lambda Architecture. This real-time layer only needs to contain a few hours or days of data, and can benefit from existing well-proven technologies such as MySQL. Because materialized views are grouped by a time-based key (day / hour), we only need to take care that the batch layer always calculates full periods (so no half-days, no half-hours). In this way, data from the batch layer always overrides the one in the real-time layer, which can be cleaned periodically.