Introducing cascading_ext: Open-Source Performance and Usability Tools for Cascading

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Introducing cascading_ext: Open-Source Performance and Usability Tools for Cascading

Engineering

At LiveRamp we make heavy use of the Cascading framework to build our Hadoop-based big data applications. Over the years we’ve developed a number of tools to make it easy to build, debug and run high-performance Cascading workflows. In the past we’ve blogged about a number of these tools, but now we’re happy to announce that these (and more) are all open-source and available via our cascading_ext project on github.

Over the next few weeks we’ll be following up with posts about when and how to use these tools. Here are a few of the highlights (you can find a more complete feature list, and some examples, on the project page):

  • BloomJoin — dramatically improves the performance of asymmetric joins (when joining datasets where one side is much larger than the other) via use of a bloom filter.
  • BloomFilter — use a bloom filter to quickly filter a stream by a set of keys.
  • MultiGroupBy — group two streams on a common key, and access both with the same buffer.

cascading_ext is very much a living project — we use all of these tools heavily internally, and we’ll be actively updating and expanding this project.

We’d love to hear any ideas/problems/feature requests you have — drop us a comment below, create a github issue, or send us a pull request — contributions are very welcome.