Reduce Task Suspension for Priority Scheduling in Hadoop

Reduce Task Suspension for Priority Scheduling in Hadoop Brian Cho and Philbert Lin

Goal Suspend Hadoop reduce tasks

Motivation Cost ($) (Safe region) (Missed deadlines) (Ideal) ? Cost of missed deadlines Cost of under-utilization (e.g. server costs) Cluster Load Cost of a production cluster

Example and Design Goals Production cluster reduce trace:1 Current approaches kill tasks → Lose all our work!  • Want to preempt tasks, by suspending (not killing) • Production jobs get resources quickly • Research jobs don’t lose work • Focus on reduce rather than map • Reduce tasks take longer, so more work to lose (median map 19s vs reduce 231s)2 # reduce slots Run research job on unused slots [1] Yahoo: private correspondence [2] Facebook: Zaharia, et. al., “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” Eurosys 2010 time

MapReduce 2.0 YARN Overview RM = Resource Manager NM = Node Manager Scheduler AM = Application Master NM NM NM NM NM RM App. Man. NM Container1,1 AM1 Container2,2 Container2,1 Container1,2 AM2 Container2,3 Based on: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

Reduce Stages sorted map outputs copy merge reduce outdir/part-00000

Suspend Lifecycle Suspend Job 2 Tasks Resume Job 2 Tasks = Production Job Scheduler = Research Job Production Job3 NM NM NM NM RM NM App. Man. NM Container1,1 AM1 Container2,2 Container3,1 Container2,2R Container2,1 Container1,2 AM3 AM2 Container2,3 Container2,3R

Suspending and Resuming Tasks Resume Task parse suspended log [Attempt 0] [Attempt 1] (exited) (skip) copy merge Suspend Task log: merged file location log: reduce progress reduce outdir/part-00000-00 outdir/part-00000-01

Suspending and Resuming Tasks Resume Task parse suspended log [Attempt 0] [Attempt 1] (exited) (skip) copy • Suspend/Resume takes advantage of existing intermediate data • File locations and progress indicators written to local logs • Fast! merge Suspend Task log: merged file location log: reduce progress reduce outdir/part-00000-00 outdir/part-00000-01

Current Status and Limitations • Implemented reduce stage • Most impactful stage • Other stages will use similar approach • If state saved across reduce calls, it will be lost • Suspend/resume done through job client • These functions can be used as-is within a scheduler

Future Work:Preemptive Priority Scheduler Design • Suspend workflow • RM asks AM for containers within T deadline • AM decides: should I kill or suspend this reduce task? • Tsuspend< Tleft to deadline • Resume workflow • RM tells AM that containers are available • Either local or remote to suspended task • AM decides: should I restart or resume this suspended task? • Twaiting for local container + Tresume < Tcatch up on a newreduce • Tmigrating intermediate data + Tresume < Tcatch up on a newreduce

Big Picture • Design of a preemptive, priority scheduler • Time-sensitive high-priority jobs can scale up instantly • Long-running low-priority jobs can scale down gracefully, without losing work • Design of estimated output over partial data • Jobs can output estimates when time is short • Naïve implementation: suspend() and use partial data output up to that point • More rigorous estimates: custom, programmer-provided estimate() function • … all combined into the same framework?

Conclusions • Implemented a low-overhead mechanism to suspend/resume reduce tasks • Supports the claim that suspending tasks can be an essential tool for a responsive priority scheduler • Future work • Implement suspend/resume for all stages of reduce task • Add user-defined function to save cross-reduce state • Design a scheduler that incorporates suspension • Experimental evaluation

Reduce Task Suspension for Priority Scheduling in Hadoop