At LiveRamp we make heavy use of the Cascading framework to build our Hadoop-based big data applications. Over the years we’ve developed a number of tools to make it easy to build, debug and run high-performance Cascading workflows. In the past we’ve blogged about a number of these tools, but now we’re happy to announce that these (and more) are all open-source and available via our cascading_ext project on github.
Over the next few weeks we’ll be following up with posts about when and how to use these tools. Here are a few of the highlights (you can find a more complete feature list, and some examples, on the project page):
- BloomJoin — dramatically improves the performance of asymmetric joins (when joining datasets where one side is much larger than the other) via use of a bloom filter.
- BloomFilter — use a bloom filter to quickly filter a stream by a set of keys.
- MultiGroupBy — group two streams on a common key, and access both with the same buffer.
cascading_ext is very much a living project — we use all of these tools heavily internally, and we’ll be actively updating and expanding this project.