LiveRamp has relied heavily on MapReduce for our big-data computation since 2009. However, the Hadoop ecosystem has grown and matured dramatically over the past 5 years, and one of the big changes has been the shift from MapReduce-centric MRv1 to MRv2 — Hadoop YARN.
YARN separates the resource allocation layer of a Hadoop cluster from the computation framework; this means that MapReduce is just one of many services which can use resources on a YARN cluster. So while currently our whole cluster is dedicated to MapReduce processes, YARN will let us allocate cluster resources to non-MapReduce services like Spark.
There are many technical resources on the advantages of switching to Hadoop YARN (MRv2), so we won’t focus on those here. But at a high level, YARN is the future of Hadoop, and we want to get onboard.
Our cluster currently runs Cloudera Hadoop 4 (CDH4), where YARN is not considered stable. By upgrading to the latest Cloudera distribution, CDH5, we’ll be able to migrate from MRv1 applications to YARN.
Some basic stats about our production Hadoop cluster:
- 6.89 PB capacity
- 6220 map task slots
- 3732 reduce task slots
- 443 core servers
- 84 gateway machines
The raw capacity of our cluster only tells half the story though. Over the past few years, both the number of customer datasets LiveRamp process and the number of engineers who interact with our cluster have grown dramatically. In fact, our MapReduce cluster is rarely running fewer than 30 concurrent jobs:
with close to 2,000 total MapReduce jobs each day. Meanwhile, our customer SLAs have gotten tighter as we promise faster turnaround time on new files. So while a few years ago, a day of downtime during a HDFS upgrade wasn’t the end of the world, we now have to aggressively test any changes we push in order to minimize cluster downtime.
So– how do we test out CDH changes before upgrading in production?
While local unit and integration testing can catch many compile or logic errors, when upgrading a framework like Hadoop there’s no alternative to running real tests on a real cluster; the complexity of a framework like HDFS + MapReduce means that local environments rarely faithfully reproduce production, for a few reasons:
- Local tests use a local MapReduce implementation instead of a standalone JobTracker, which can lead to subtle behavior differences.
- Running the MapReduce client and local task runner in the same JVM instance makes it very difficult to test memory requirements or serialization problems
- Native libraries like those used for LZO compression will not be the same locally and in production
We use Cloudera Manager to manage our production Hadoop cluster, so we were able to easily pull a few machines out of our production cluster and set up a separate test cluster running HDFS, MapReduce, and Zookeeper services alongside our production one:
To reduce the amount of work we have to do each upgrade, we have a standard compatibility test suite we run before each upgrade. The script:
- Copies files from our production cluster, to make sure data will deserialize correctly
- Runs jobs using our custom Cascading Taps and SubAssemblies
- Tests memory-intensive Functions
- Reads the output datasets and asserts correctness
If these all work on the test cluster, we can be fairly confident we won’t run into problems post-upgrade.
Within a couple months we hope to be up-to-date on CDH5 and shifting our MRv1 jobs to YARN processes. We’ll keep you all updated on that processes, and any interesting problems we run into on the way.