Introducing MockRDD for testing PySpark code

Summary The LiveRamp Identity Data Science team is excited to share some of our PySpark testing infrastructure in the new open source library mockrdd. This contains the class MockRDD, which mirrors the behavior of PySpark RDD with several additions: Extensive sanity checks to identify invalid inputs More meaningful error messages for debugging issues Straightforward to […]

LiveRamp Is Migrating to Google Cloud Platform

Have you ever thought of what it takes to move 97 petabytes to the cloud? Now think about those 97 petabytes of stored data flanked with 300 terabytes of memory and 90,000 CPU cores, and you have the exact size of our Hadoop environment. Over the next year, our engineering team will be migrating this […]

[Opinion] Guidelines for Effective Meetings

Meetings can be an effective way to discuss projects and reach consensus on decisions. But I believe when ran improperly, they can also be a huge waste of our valuable time. Hence, we should have some guidelines on how to create and run effective meetings. Meetings are about discussions The whole reason we get a […]

Using Machine Learning to Auto Detect Column Types in Customer Files

Introduction LiveRamp receives thousands of large files each day from our customers and we need column type configuration to know how to interpret these files. For many files, we expect them to conform to an existing configuration. For others we need to auto-detect the type of data within the file. Here’s a contrived example of […]

Friday Thoughts: strive for phoenixes, not snowflakes

tl;dr Avoid surprises, restart your workloads often to ensure your app starts as expected. Server outages can produce some interesting surprises. As we continue to strive for dependable interfaces and services between our squads, I am reminded of the snowflakes and phoenix analogy, a re-branding of pets vs. cattle. Snowflake: unique, fragile, unpredictable Phoenix: rise from nothing at […]