Engineering Blog

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Computing Distributions of Large Datasets with Cascading and the q-digest Algorithm

Distributions are a powerful tool for understanding datasets. As an example, imagine that you’re interested in quantifying user engagement for a new app you’re developing. To this end you compute the distribution of monthly engagement time for your users and discover the following trends: You learn that most of your “users” rarely spend anytime using ...

MultiCombiner

We’re happy to announce that we’ve added a new tool built on top of Cascading, called MultiCombiner, to our open source project cascading_ext. MultiCombiner allows one to run arbitrarily many aggregation operations, each with their own distinct grouping fields, over a single stream of tuples using only a single reduce step. MultiCombiner uses combiners ...

Automatic logging of MapReduce task failures

When using Cascading to run MapReduce jobs in production, the most common exception we find in our job logs look like this: Caused by: cascading.flow.FlowException: step failed: (1/1), with job id: job_201307251526_37599, please see cluster logs for failure messages at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:210) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) This exception tells us that the job failed ...

Debugging “ClassCastException: cascading.tap.hadoop.io.MultiInputSplit” exceptions when testing Cascading flows

When testing our Hadoop data workflows we've intermittently run into this error, which ends up failing the MapReduce job being tested: java.lang.ClassCastException: cascading.tap.hadoop.io.MultiInputSplit cannot be cast to org.apache.hadoop.mapred.FileSplit at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:371) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) A quick search for the error didn't find any obvious problems. When we dug into the problem a a bit more, we noticed a couple ...