Introducing MockRDD for testing PySpark code


The LiveRamp Identity Data Science team is excited to share some of our PySpark testing infrastructure in the new open source library mockrdd. This contains the class MockRDD, which mirrors the behavior of PySpark RDD with several additions: Extensive sanity checks to identify invalid inputs More meaningful error messages for debugging issues ...

[Opinion] Guidelines for Effective Meetings

Meetings can be an effective way to discuss projects and reach consensus on decisions. But I believe when ran improperly, they can also be a huge waste of our valuable time. Hence, we should have some guidelines on how to create and run effective meetings.

Meetings are about discussions

The whole reason ...

Utilizing MapReduce Combiners and HyperLogLog++ to process millions of queries over datasets with billions of records

The fundamental challenge being solved is aggregating and counting the unique values in a large multiset. Here is a matrix visualization of one dataset:  The goal is to be able to retrieve counts of unique IDs with different constraints. For example (using SQL-like syntax to illustrate), computing COUNT (DISTINCT ID) WHERE (Field 1 = 0) ...

Friday Thoughts: strive for phoenixes, not snowflakes

tl;dr Avoid surprises, restart your workloads often to ensure your app starts as expected. Server outages can produce some interesting surprises. As we continue to strive for dependable interfaces and services between our squads, I am reminded of the snowflakes and phoenix analogy, a re-branding of pets vs. cattle. Snowflake: unique, fragile, unpredictable Phoenix: rise from nothing at any time, ...