260 likes | 423 Views
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications. Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012. Dynamic load balancing on 100,000 processor cores and beyond. Iterative Applications.
E N D
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012
Dynamic load balancing on 100,000 processor cores and beyond
Iterative Applications • Applications repeatedly executing the same computation • Static or slowly evolving execution characteristics • Execution characteristics preclude static balancing • Application characteristics (comm. pattern, sparsity,…) • Execution environment (topology, asymmetry, …) • Challenge: Load-balancing such applications
Overdecomposition • Expose greater levels concurrency than supported by hardware • Middleware (runtime) dynamically maps the concurrent tasks to hardware resources • Abstraction supports continuous optimization and adaptation • Improvements to load balancing • New metrics (power, energy, graceful degradation, …) • New features: fault tolerance, power/energy-awareness
Problem Statement • Scalable load balancers for iterative overdecomposed applications • We consider two alternatives: • Persistence-based load balancing • Work stealing • How do these algorithms behave at scale? • How do they compare?
Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes
TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution
TASCEL Execution • Task: basic unit of migrateable execution • Typical workflow: • Create a task collection • Seed it with one or more tasks • Process tasks in the collection till termination detection • Processing of task collections • Manages concurrency, faults, … • Trade-offs exposed through implementation specializations • Dynamic load balancing schemes • Fault tolerance protocols • …
Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing
Greedy Localized Hierarchical Persistence-based LB 0 1 2 5 3 4 0 1 2 3 4 5 Intuition: Satisfy local imbalance first
Greedy Localized Hierarchical Persistence-based LB 0 1 2 5 3 4 0 1 2 3 4 5 Intuition: Satisfy local imbalance first
Retentive Work Stealing Local Queues Work Pool Proc 1 Proc 2 Proc 3 … Proc n
Retentive Work Stealing split stail head Remote Local
Retentive Work Stealing addTask(): add task to local region getTask(): remove task from local region split stail head Buffer of locally executed tasks
Retentive Work Stealing acquireFromShared(): move to local portion releaseToShared(): move to shared portion split stail head
Retentive Work Stealing 3. Worker updates ctailwhen stail == itail 2. Atomically increment itail on completion of transfer 1. Mark tasks stolen at stail and begin transfer split ==itail itail stail ==ctail ctail head stail: beginning of tasks available to be stolen itail: number of tasks that have finished transfer ctail: past this marker it is safe to use buffer
Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 … Proc n Proc 2 Proc 3 Proc n Proc 1 Intuition: Stealing indicates poor initial balance
Retentive Work Stealing • Active message based work stealing optimized for distributed memory • Exploit persistence across work stealing iterations • Each work stealing phase • Track tasks executed by this worker in this iteration • Seed with tasks executed by this worker for the next iteration
Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker
Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds
Hopper: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism
Intrepid: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Much worse performance for the first iteration Converges to a better efficiency than on Hopper
Titan: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Similar behavior as on Intrepid
Intrepid: Num. Steals Attempted steals Successful steals Num. steals Core count Core count Retentive stealing stabilizes stealing costs Similar trends on all systems
Utilization Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost
Summary of Insights • Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan • Retentive stealing and persistence-based load balancing perform comparably • Retentive stealing incrementally improves balance • Number of steals does not grow substantially with scale • Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)