One of our previous blog posts provided an overview of our data onboarding system, covering the basics of segmentation, matching and distribution. In this post, we will delve into the details of server-side distribution, one of the primary mechanisms we employ to transfer data to ad networks and data management platforms (DMPs).
The traditional method to transfer data to online destinations is based on placing an image pixel on a web page. When the page is opened by a browser, a request is made to retrieve the pixel, which gives the server handling that request the opportunity to drop and/or read a cookie, and issue a redirect URL with data appended to it in the form of a query string. We refer to this method as client-side distribution.
Server-side distribution, on the other hand, does not involve the browser at all, which takes away the latency and data size restrictions of the client-side approach. A client-side system is still needed to do cookie syncing, but all data transfer is done directly from server to server.
A more comprehensive description of cookie syncing and the benefits of server-side distribution is available here.
It is important to note that no personally identifiable information (PII) is used anywhere in this system. The segment data is anonymous and linked to browser-based identifiers only. Identifiers such as email address and postal address are used in the matching process and stripped from any of the data stores used for distribution.
Server-side distribution has two delivery modes: streaming and batch. Streaming mode is the continuous delivery of data by making HTTP requests to the API of the ad network/DMP. Batch mode consists of creating a file with data for all cookies synced in a time period (e.g. a day, a month) and sending it to the (S)FTP server of the ad network/DMP. We support both modes and let the receiving end choose which is better for them.
The high level architecture of our server-side distribution system is shown in the diagram below. The webapp servers do cookie syncing following the traditional client-side approach. These syncs are logged and then loaded into HDFS and processed by a workflow composed of several map-reduce jobs in our Hadoop cluster. Finally, the processed data is sent to the ad network/DMP using one of the delivery modes.
A more detailed description of each component is provided below.
The refresher is a collection of map-reduce jobs in charge of generating data sync requests on a periodic basis. A data sync request is a self-contained object that contains everything that is necessary to send data for a specific cookie to a specific ad network/DMP. The refresher processes the logs generated by the webapp servers, extracting all the information relevant to cookie syncs, and joins that with the segment data. It also maintains a data store specialized in cookie syncs, which can be thought of as a cache of cookie syncs that makes it unnecessary to reprocess logs often.
The batch formatter is also a collection of map-reduce jobs running as part of the same workflow as the refresher. It formats all data sync requests for a given destination according to some agreed template. This usually involves grouping all data by cookie id or segment and adding extra fields (e.g. expiration timestamp) if necessary. The batch nature of Hadoop makes it perfect for this task.
The file deliverer is a lightweight Java process that downloads the file produced by the batch formatter from HDFS to local disk, adds header information if necessary, and sends it using (S)FTP. A daemon is in place that monitors the status of the formatter’s output and invokes the deliverer as soon as there is a new file to be sent.
The queue is a way of reconciling the batch nature of the refresher with the streaming data syncer. There is a single, very fast producer, and several, relatively slow consumers. Our first version was handled entirely by an ActiveMQ broker. This scaled well when we had a single queue, with no noticeable performance degradation when going from just millions to hundreds of millions of items on the queue. However, once we started integrating with more ad networks/DMPs, a single queue was not enough anymore. Adding priorities was also necessary to ensure new syncs were pushed out with low latency regardless of how many total items were on the queue.
ActiveMQ proved not to be up to the challenge and became the bottleneck of the system, despite considerable efforts to tune it to our use case. To solve this problem, we implemented a hybrid HDFS-ActiveMQ queue in which the heavy lifting is done by Hadoop, and ActiveMQ only handles around 10 million items at a time, all in memory.
Streaming Data Syncer
The streaming data syncer is a multi-threaded Java process that consumes data sync requests from the queue and makes the corresponding HTTP requests to an external API.
Growth and Scaling
Our server-side distribution system is less than a year old and has already experienced very rapid growth. In the beginning, we had a server-to-server integration with a single partner handling a volume of 20 million unique cookie syncs per month. As of this writing, we are integrated with tens of partners and handle a volume of around 4.5 billion unique cookie syncs per month. To accommodate this growth, the following changes were made:
- Instead of using a single queue, we split it into several queues, one for each partner-priority combination. As mentioned above, a single ActiveMQ broker could not handle the load and we supplemented it with an HDFS-based queue.
- The number of streaming data syncer machines went from two to five, and some optimization was done to the number of threads used in each instance.
- For batch mode delivery, we negotiated with our partners to change the output format in cases where the original was inefficient (e.g. requiring the same cookie id to be duplicated a lot). We also tried to avoid sending uncompressed files as much as possible.
- General improvements and tweaks were made across the board, specially in the map-reduce jobs.
This is Just the Beginning
Server-side distribution has become a core component of our infrastructure in a matter of months. In addition to scaling to accommodate ever growing volume, implementing monitoring, reporting, and testing tools has been very important to keep the system operational. We have a long and exciting road ahead to make the system as robust, automated, and scalable as it needs to be to satisfy future growth.