Hackweek XXXVII was our largest yet with twenty different projects developed by LiveRampers! Hackweek is the magical week that happens five times a year at LiveRamp when Engineers get a full week to work on whatever they choose. The week is filled with excitement and traditions such as the Hackday crepes. It kicks off with a pitch session where those with ideas look to recruit others for their project and ends with a company-wide presentation where we share what we developed. Here’s a summary of some of the projects from this Hackweek.
ConNotify – Connect Notification Center
There are many important events that our customers need to be notified of: when a dataset of theirs has been matched and delivered to a marketing platform, new product features we’ve developed, planned downtime for maintenance, etc. In the past we’ve relied on email for these notifications, but Nora, Parker, and Alfonso realized we could do better using in-app messaging. To this end they developed a new Ruby service for receiving and managing notifications from numerous sources. Further they added UI components to our Connect web app to show these messages, including real time updates as new messages are received.
Just like a good spring cleaning, it can be quite satisfying to archive and delete old data that is no longer needed. This not only frees up disk space on our Hadoop cluster, but also gives us a clearer view of what data sets are currently being used. Unfortunately it’s prohibitively expensive to manually clean up old data stores and the development of automated data deletion workflows is complicated and risky (a bug that deleted active production data could grind workflows to a halt while we recover from backup). For these reasons we’ve not kept our cluster as tidy as possible.
Bruno, Damien, Tenzing, and Evan decided to clean things up this Hackweek by developing Sweeper: a general purpose framework for archiving and deleting old data. Building off the work of a past Hackweek project, they productionized the Sweeper infrastructure and added features such as extensible rules for selecting which files to sweep. The framework is general enough to support the sweeping of a range of different resources: Hadoop distributed files, local files, S3 buckets, and soon database records. With Sweeper they developed the first of many new sweeping workflows in just 15 minutes. After confirming that it was deleting the files they wanted (using Sweeper’s dry run feature) they enabled the new sweeper and it cleared out over 500k files, freeing 14TB of disk space!
BI Tools for Customer Link
Hackweek isn’t just for engineers and many non-Engineers at LiveRamp find time to work on their novel ideas during Hackweek. Adam from sales suspected many of our customer’s were unnecessarily missing out on the value of LiveRamp’s CustomerLink (CL) product because they didn’t think they could analyze CL data without building a dedicated Data Science team or contracting with an attribution platform. He set out to show them how they could analyze CL data on their own and gain valuable insight into consumer preferences using off the shelf business intelligence (BI) tools such as Tableau.
Adam partnered up with Matt, Katie, Julian, and John to analyze a 20M record CL dataset using these BI tools. This cross functional team of sales, marketing and product LiveRampers proved that you don’t need Data Scientists nor an attribution platform to quantify the effectiveness of different online marketing campaigns on in store purchases using CL data. Further, they learned some best practices for working with these BI tools to share with our customers so that they too can analyze their own CL data.
One Hour Tom
We’re always looking to accelerate our data matching workflows, which translate between different anonymous identifiers for a consumer using our match data (more matching details in our graphical model for match data blog post). To this end we’ve previously explored using random access key-value stores (e.g. Hank and Cassandra) to manage our match data in place of Hadoop data stores. Such an approach is great for matching small data sets (10 thousand – 10 million records) in that we only need to query the specific matches needed for records in that data set as opposed to reading in the full match data store, which generally contain 100s of millions of match entries.
We’ve previously only used key-value store matching for a limited number of matching products and this Hackweek Stefan, Takashi, Abhishek, and myself wanted to see if we could apply these methods to power our core data onboarding product. In this project we built a proof of concept data onboarding workflow power by key-value stores. We further put together a new self-serve UI that would allow our customers to upload small data sets and have them quickly matched and delivered to a marketing platform. While this project is far from being ready to deploy to production, it was exciting to see how we could offer near real time matching of small data sets and open up new time-critical use cases for our customers. We’ll definitely be revisiting and productionizing this project in a future engineering dev cycle.
Map-only MSJ Repartitioning
Finding the connections between large data sets is core to LiveRamp’s data processing workflows and to do this efficiently and scalably we’ve developed our MapSideJoin (MSJ) framework. You can learn more about MSJ in our past blog post on adding seeking indices to MSJ stores. A core concept of MSJ is consistently partitioning large data sets into small data stores through sharding by the join key. To optimize performance of MapReduce jobs, we select an appropriate number of partitions based on the anticipated size of the MSJ store with more partitions for larger stores.
It used to be complicated and resource intensive to change the number of partitions (repartition a MSJ store), but thanks to Ben, Porter, and Soerian’s Hackweek project this is no longer the case. We’ll save the technical details of how they implemented map-only repartitioning in our MSJ framework for a future blog post, but it suffices to say that the new approach is not only simpler and more efficient, but it can also happen within our current data processing workflows so that we needn’t pause these workflows while running a specialized repartitioning workflow.
And this is just is just a sampling of what LiveRampers accomplished in Hackweek XXXVII! Check out our previous Hackweek recaps to discover what other cool things have been developed in Hackweeks.