Hackweek is always a special time at LiveRamp, but an additional Thursday and Friday of hacking made this Hackweek 40% more special than ever before. A long form Hackweek meant LiveRampers could pursue projects with larger scale and greater complexity, from infrastructure improvements to analytical tools to investments in company culture.
As Hackweek has matured into its new form, so too has its menu. We mourned the loss of traditional Thursday crepes this quarter but took solace in the traditional paella, traditional large cookies and traditional pizza. We even celebrated the rise of an exciting new tradition, Hackweek Taco Bar.
Better DB Usage, Better Connect
LiveRamp’s customer-facing web app, Connect, had been experiencing multi-second load times for certain pages coinciding with high load on the relational database which it reads from. We can now use our new ELK stack (Elasticsearch, Logstash and Kibana) to hone in on expensive queries. ELK allows us to ship all Connect logs to Logstash, quickly query them via Elasticserach, and visualize load time data on a new live Kibana dashboard. Using this information, we optimized expensive queries by adding more appropriate indices and denormalizing some frequently joined tables, For example,we can now perform a query which previously took 21 seconds in just 0.1 seconds. In the future we can easily check this Kibana dashboard for slow page loads and iterate on expensive queries in order to maintain 99th percentile response times under 1 second.
Squawker: Service Callbacks
As we move more of our workflows to Service-Oriented Architecture, we have an increasing need to check for updates on running jobs in a service. Previously we would have many clients polling the service every few seconds to see if a job was done, but with many simultaneous long-running jobs this can become extremely expensive. Using Apache Kafka, Squawker pushes status updates to a central cluster when a job finishes or fails and allows clients to cheaply poll the cluster for all status updates for all jobs.
Serialization is a key determinant of MapReduce job performance. Finding a smaller representation of our input data saves us network communication and disk I/O time in MapReduce jobs, and faster serialization/deserialization saves us CPU time in MapReduce tasks. Having heard rumors that Protocal Buffers was a more performant and efficient serialization tool than our current solution, Apache Thrift, this group did a side-by-side comparison. They found that Protobuf was not clearly a better choice. A 2 TB Thrift-serialized store was 1.3% smaller when Protobuf-serialized, but a 15 TB Thrift-serialized store holding a different object type grew by 2% when Protobuf-serialized. Serialization speed was similarly ambiguous. Though we didn’t get the benefit of a better serialization tool, we ruled out a possible optimization – scientifically.
MapReduce Job Info
A MapReduce job will always take as long as its slowest tasks, but the MapReduce UI doesn’t give much readable information on what the other tasks are doing. MapReduce Job Info is a lightweight internal chrome extension which automatically fetches task completion times for a given MapReduce job and displays histograms directly on the MapReduce UI. At the glance of an eye we can tell if our slow job is truly slow, or just dragged down by a few rogue, poorly-balanced tasks.
On the day of Hackweek presentations, the creators of this project asked to have their work displayed in absolute silence. The crowd was awed. I will show the same respect here.