Automatic logging of MapReduce task failures

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Automatic logging of MapReduce task failures

Engineering

When using Cascading to run MapReduce jobs in production, the most common exception we find in our job logs look like this:

Caused by: cascading.flow.FlowException: step failed: (1/1), with job id: job_201307251526_37599, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:210)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

This exception tells us that the job failed because of map or reduce task failures, but doesn’t give us any information about the actual cause of the failures. We can get that information from task logs on the JobTracker, but that can be a hassle, especially if the job has already been flushed to the JobTracker’s history.

To save ourselves the hassle of looking up the failures, we’ve modified our LoggingFlow helper class to fetch remote task failure information and log it alongside the Cascading job failure message. Our logs are now a bit more informative:

13/08/18 22:54:33 INFO flow.Flow: step attempt failures:
Cluster Log Exception:
Caused by: java.lang.RuntimeException: Intentionally failing!
<snip>
——————————————————————————–
Caused by: cascading.flow.FlowException: step failed: (1/1), with job id: job_201307251526_37599, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:210)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It’s not a huge change, but has certainly saved us time and headaches hunting down the cause of failures. This logging is now enabled automatically when using our CascadingUtil helper class in our cascading_ext library.