1 / 26

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications. Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012. Dynamic load balancing on 100,000 processor cores and beyond. Iterative Applications.

danae
Download Presentation

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012

  2. Dynamic load balancing on 100,000 processor cores and beyond

  3. Iterative Applications • Applications repeatedly executing the same computation • Static or slowly evolving execution characteristics • Execution characteristics preclude static balancing • Application characteristics (comm. pattern, sparsity,…) • Execution environment (topology, asymmetry, …) • Challenge: Load-balancing such applications

  4. Overdecomposition • Expose greater levels concurrency than supported by hardware • Middleware (runtime) dynamically maps the concurrent tasks to hardware resources • Abstraction supports continuous optimization and adaptation • Improvements to load balancing • New metrics (power, energy, graceful degradation, …) • New features: fault tolerance, power/energy-awareness

  5. Problem Statement • Scalable load balancers for iterative overdecomposed applications • We consider two alternatives: • Persistence-based load balancing • Work stealing • How do these algorithms behave at scale? • How do they compare?

  6. Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes

  7. TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution

  8. TASCEL Execution • Task: basic unit of migrateable execution • Typical workflow: • Create a task collection • Seed it with one or more tasks • Process tasks in the collection till termination detection • Processing of task collections • Manages concurrency, faults, … • Trade-offs exposed through implementation specializations • Dynamic load balancing schemes • Fault tolerance protocols • …

  9. Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing

  10. Greedy Localized Hierarchical Persistence-based LB 0 1 2 5 3 4 0 1 2 3 4 5 Intuition: Satisfy local imbalance first

  11. Greedy Localized Hierarchical Persistence-based LB 0 1 2 5 3 4 0 1 2 3 4 5 Intuition: Satisfy local imbalance first

  12. Retentive Work Stealing Local Queues Work Pool Proc 1 Proc 2 Proc 3 … Proc n

  13. Retentive Work Stealing split stail head Remote Local

  14. Retentive Work Stealing addTask(): add task to local region getTask(): remove task from local region split stail head Buffer of locally executed tasks

  15. Retentive Work Stealing acquireFromShared(): move to local portion releaseToShared(): move to shared portion split stail head

  16. Retentive Work Stealing 3. Worker updates ctailwhen stail == itail 2. Atomically increment itail on completion of transfer 1. Mark tasks stolen at stail and begin transfer split ==itail itail stail ==ctail ctail head stail: beginning of tasks available to be stolen itail: number of tasks that have finished transfer ctail: past this marker it is safe to use buffer

  17. Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 … Proc n Proc 2 Proc 3 Proc n Proc 1 Intuition: Stealing indicates poor initial balance

  18. Retentive Work Stealing • Active message based work stealing optimized for distributed memory • Exploit persistence across work stealing iterations • Each work stealing phase • Track tasks executed by this worker in this iteration • Seed with tasks executed by this worker for the next iteration

  19. Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker

  20. Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds

  21. Hopper: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism

  22. Intrepid: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Much worse performance for the first iteration Converges to a better efficiency than on Hopper

  23. Titan: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Similar behavior as on Intrepid

  24. Intrepid: Num. Steals Attempted steals Successful steals Num. steals Core count Core count Retentive stealing stabilizes stealing costs Similar trends on all systems

  25. Utilization Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost

  26. Summary of Insights • Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan • Retentive stealing and persistence-based load balancing perform comparably • Retentive stealing incrementally improves balance • Number of steals does not grow substantially with scale • Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)

More Related