A nuclear option for freeing cluster resources

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

A nuclear option for freeing cluster resources

Engineering

While Hadoop has a number of tools available to make sure cluster resources are allocated intelligently (such as the fair scheduler and the YARN project), we’ve found that for time-critical issues these tools don’t always give enough granularity of control; occasionally we need to run a job VERY quickly (usually when debugging or fixing a major bug) and can’t wait for the fair scheduler to pre-empt enough slots–when the cluster is very busy with long-running map or reduce tasks, it’s not always easy to convince the JobTracker to give your job all necessary resources, since even the fair scheduler is reluctant to pre-empt tasks which have been running for hours.

Of course, it’s always possible to kill competing jobs entirely with “hadoop job -kill”, but this can cause problems for other cluster users by making it appear that their job failed and forcing them to restart it later. A better solution is to free up the required map and/or reduce slots without failing any jobs outright, using “hadoop job -kill-task”; this will kill the task attempt without failing the parent job, so Hadoop will restart the task when the slot is free again.

Of course, manually killing tasks is far too tedious when your job needs 800 map slots, so we wrote a quick script to intelligently but forcefully kill tasks attempts in a running job. The script is fittingly called KillTaskAttempts, and can be found here.  The arguments are simple:

[job pattern to target] [map|reduce] [tasks to kill]

The script targets jobs either by name or job_id. When asked to free X map (or reduce) slots, the script will look across all matching jobs, find the X task attempts with the least amount of progress, and kill them.

This is not a script which should be used indiscriminately, but it definitely helps us re-allocate resources when critical issues arise.