Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012

Dynamic load balancing on 100,000 processor cores and beyond

Iterative Applications • Applications repeatedly executing the same computation • Static or slowly evolving execution characteristics • Execution characteristics preclude static balancing • Application characteristics (comm. pattern, sparsity,…) • Execution environment (topology, asymmetry, …) • Challenge: Load-balancing such applications

Overdecomposition • Expose greater levels concurrency than supported by hardware • Middleware (runtime) dynamically maps the concurrent tasks to hardware resources • Abstraction supports continuous optimization and adaptation • Improvements to load balancing • New metrics (power, energy, graceful degradation, …) • New features: fault tolerance, power/energy-awareness

Problem Statement • Scalable load balancers for iterative overdecomposed applications • We consider two alternatives: • Persistence-based load balancing • Work stealing • How do these algorithms behave at scale? • How do they compare?

Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes

TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution

TASCEL Execution • Task: basic unit of migrateable execution • Typical workflow: • Create a task collection • Seed it with one or more tasks • Process tasks in the collection till termination detection • Processing of task collections • Manages concurrency, faults, … • Trade-offs exposed through implementation specializations • Dynamic load balancing schemes • Fault tolerance protocols • …

Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing

Greedy Localized Hierarchical Persistence-based LB 0 1 2 5 3 4 0 1 2 3 4 5 Intuition: Satisfy local imbalance first

Retentive Work Stealing Local Queues Work Pool Proc 1 Proc 2 Proc 3 … Proc n

Retentive Work Stealing split stail head Remote Local

Retentive Work Stealing addTask(): add task to local region getTask(): remove task from local region split stail head Buffer of locally executed tasks

Retentive Work Stealing acquireFromShared(): move to local portion releaseToShared(): move to shared portion split stail head

Retentive Work Stealing 3. Worker updates ctailwhen stail == itail 2. Atomically increment itail on completion of transfer 1. Mark tasks stolen at stail and begin transfer split ==itail itail stail ==ctail ctail head stail: beginning of tasks available to be stolen itail: number of tasks that have finished transfer ctail: past this marker it is safe to use buffer

Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 … Proc n Proc 2 Proc 3 Proc n Proc 1 Intuition: Stealing indicates poor initial balance

Retentive Work Stealing • Active message based work stealing optimized for distributed memory • Exploit persistence across work stealing iterations • Each work stealing phase • Track tasks executed by this worker in this iteration • Seed with tasks executed by this worker for the next iteration

Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker

Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds

Hopper: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism

Intrepid: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Much worse performance for the first iteration Converges to a better efficiency than on Hopper

Titan: Performance Persistence-based load balancing Retentive Stealing Avg. tasks per core Efficiency Core count Core count Similar behavior as on Intrepid

Intrepid: Num. Steals Attempted steals Successful steals Num. steals Core count Core count Retentive stealing stabilizes stealing costs Similar trends on all systems

Utilization Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost

Summary of Insights • Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan • Retentive stealing and persistence-based load balancing perform comparably • Retentive stealing incrementally improves balance • Number of steals does not grow substantially with scale • Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Presentation Transcript

Structured Load Tests for Web Applications

Instant-access cycle stealing for parallel applications requiring interactive response

Container Based Persistence

An Adaptive Task Creation Strategy for Work-Stealing Scheduling

Stealing

Scheduling Multithreaded Computations B y Work-Stealing

Mystery Balancers

Load Based Testing for Unitary HVAC

Idempotent Work Stealing

‘Stealing’

Work Stealing Scheduler

Cache Based Iterative Algorithms

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications

Load Analysis and Prediction for Responsive Interactive Applications

Elfiq Link Load Balancers

Load Testing Web Based Applications

Work Load Assessments

A Server Solution for Cookie-Stealing-Based XSS Attacks

TRUCK WHEEL BALANCERS

Iterative Computer Algorithms : and their applications in engineering

Work Load and Pre Employment

Load balancing for SIP media applications