Debugging “ClassCastException: cascading.tap.hadoop.io.MultiInputSplit” exceptions when testing Cascading flows

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Debugging “ClassCastException: cascading.tap.hadoop.io.MultiInputSplit” exceptions when testing Cascading flows

Engineering

When testing our Hadoop data workflows we’ve intermittently run into this error, which ends up failing the MapReduce job being tested:

java.lang.ClassCastException: cascading.tap.hadoop.io.MultiInputSplit cannot be cast to org.apache.hadoop.mapred.FileSplit
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:371)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

A quick search for the error didn’t find any obvious problems. When we dug into the problem a a bit more, we noticed a couple interesting patterns-

  • The exception was definitely transient, failing some builds and passing others
  • The error happened almost exclusively when running Cascading jobs
  • Increasing the concurrency the mapreduce jobs within a workflow made the exception more likely
  • Even running two concurrent builds on our build server (in separate JVM instances) could produce the error
  • The error never appeared in production, only in tests

After it became clear that the problem was not a configuration error on our end and was seriously hurting our build stability, we spent some time debugging it in depth. We ended up discovering a bug in how Hadoop’s LocalJobRunner was launching jobs. The Hadoop ticket can be found here. Specifically, what was happening was:

  1. We would run a test which launched two or more mapreduce jobs concurrently.
  2. Job 1 would start and clear out the localRunner/ workspace and write job_local_0001.xml
  3. Job 2 would start and clear out localRunner/ and write job_local_0002.xml
  4. Job 1’s map task would start and try to read job_local_0001.xml. However, this wouldn’t exist anymore, so Hadoop’s configuration would use the default settings instead. So instead of using the input format that Cascading was putting in the configuration, Hadoop would default to using a TextInputFormat. So naturally TextInputFormat got confused when it saw a MultiInputSplit and failed.

The short term solution to the problem is to avoid running local MapReduce jobs concurrently, or to use a patched hadoop-core jar (our current solution). We’re working with Cloudera on getting the Hadoop patch integrated into a future CDH release.