1 / 39

Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler

Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler. Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin University of Maryland, College Park Tuesday Jan. 12 th , 2010 PPoPP, Bangalore. Outline. Introduction/Motivation Work-Stealing Background

toan
Download Presentation

Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin University of Maryland, College Park Tuesday Jan. 12th, 2010 PPoPP, Bangalore

  2. Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland

  3. We present an a dynamic scheduling algorithm of parallel tasks to cores based on work-stealing Target: Shared memory UMA Contribution: Our work-stealer adapts parallelism granularity to run-time conditions by avoiding the creation of excessive parallelism when the system is heavily loaded.

  4. Why Dynamic Scheduling? Static Scheduling Easy E.g., split do-all iterations by number of threads Works well in some cases E.g., Similar amount of work per iteration Can Cause load-imbalance in other cases E.g., nested do-alls or load-imbalanced iterations Dynamic Scheduling More complex (some overheads) Compiler or programmer must worry about parallelism overheads & grain-size Great Load-Balance & Performance Even/Especially for irregular or nested parallelism Dynamic Coarsening of parallelism

  5. Why Nested Parallelism ? void quicksort(int A[], int start, int end) { int pivot = partition ( A, start, end); spawn(0,1) { if ($==0) quicksort (A, start, pivot); else quicksort (A, pivot+1, end); } } • Ease-of-Programming • It occurs naturally in many programs • E.g., divide-and-conquer (irregular parallelism) • Outer parallelism doesn’t create enough parallelism • Outer parallelism creates load-imbalanced threads • Modularity: a programmer should be able to call a function creating parallelism from sequential or parallel contexts alike.

  6. Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 6

  7. Work-Stealing Background Cores Deques Shared Work-Pool Task Descriptors • Parallel do-all loops can introduce a huge number of potentially fine-grain iterations • A Task Descriptor (TD) is a wrapper that contains multiple fine-grain iterations

  8. Work-Stealing Scheduling Work-Stealing Overview: Scales Good Locality [Ackar et. al., The data locality of work stealing] Good memory footprint [ ≤P S1 ] Provably efficient [ ] Low synchronization overheads Stealing Phase: randomized probing from all idle processors

  9. Eager Binary Splitting (EBS) Focusing now on parallel do-alls… When a TD with niterations is created, it is recursively split: TDs with n/2, n/4, . . . , 1iterations are pushed on the deque It may not be profitable to split all the way down to 1 iteration Splits  deque transactions  memory fences: Expensive Performance Degradation

  10. EBS’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Excessive Splitting = Performance Loss Time

  11. Solutions to Excessive Splitting To reduce excessive splitting TBB offers two options Simple Partitioner (SP) Auto Partitioner (AP)

  12. #1: Simple Partitioner EBS (SP) Stop splitting a TD if it contains fewer than sst (stop-splitting-threshold) iterations. i.e., combine sst iterations TD.#it>TD.sst Execute Remaining Threads NO YES Split TD; Place ½ on deque

  13. SP’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores sst: 2 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Time

  14. What determines a good sst ? Small enough so that enough parallelism is created Keep all processors busy Load Balance Large enough to avoid excessive splitting Goal: find a happy medium How?

  15. TBB: Suggested Procedure for Determining sst TBB calls sst “grain-size” This is verbatum fromTBB's reference manual: Set the grainsize parameter to 10,000. This value is high enough toamortize scheduler overhead sufficiently for practically all loop bodies,but may unnecessarily limit parallelism. Run your algorithm on one processor. Start halving the grainsize parameter and see how much the algorithmslows down as the value decreases. A slowdown of about 5-10% is a good setting for most purposes.

  16. SP's Grainsize Drawbacks SP's suggested procedure for determining sst is: Manual for each do-all loop and requires multiple re-executions Not performance portable To different datasets To different platforms Not adaptive to context E.g. executing a do-all creating 10,000 iterations from the original serial thread vs. from a nested context. Fixed Grain size

  17. Summary of SP problems Manual sst Tedious, hurts productivity Fixed sst Code is not performance portable Excessive Splitting Performance penalty

  18. #2: Auto Partitioner EBS (AP) When a TD with N iterations is created we want it split into enough TDs to create enough parallelism, but not too much: Split it (recursively) into K*P chunks P is the number of cores, K is a small constant TD.chunks>1 Execute Remaining Threads NO YES Split TD; Place ½ on deque

  19. AP’s splitting: Overall View N: 4096 iterations P: 64 Cores Chunks=2*P (K=2) Width/Depth 1-4K 1 1 2 2 2K-4K 1-2K 1-64 64 7 1-32 128 8 N logN+1 Time

  20. Comparing AP to SP No manual sst (grain size) Performance Portable: To Different Datasets (somewhat) Not for small amounts of parallelism To Different Platforms (#cores) …but AP is NOT adaptive to context For n levels of nesting (K*P)n TDs created Excessive Splitting!

  21. Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 21

  22. Our Approach: Lazy Binary Splitting (LBS) 1st Insight: It is unlikely to be profitable to split a TD to create more work for other cores to steal if they are busy! How do we know if others are busy? 2nd Insight: We can check if the local deque of a processor is empty as an approximation to if other processors are likely to be busy. deque non-empty  TD not stolen others busy LBS: if the local deque is not empty postpone splitting by executing a few (ppt) iterations of a TD locally, then check again. Runtime granularity adaptation based on load

  23. Our Approach: Lazy Binary Splitting (LBS)[2] The profitable parallelism threshold (ppt) ensures that extremely fine grain parallelism is coarsened ppt also ensures deque checks are not performed too frequently which could harm performance ppt is compiler determined (statically) TD.#it>TD.ppt Execute Remaining Iterations NO YES Is deque Empty? Split TD; Place ½ on deque Execute ppt Iterations YES NO

  24. LBS: High Level Picture 49-56 57-64 Width/Depth N: 4096 threads P: 64 Cores ppt: 8 grain size 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 128 8 1-32 128 9 33-48 128 10

  25. Why LBS is better Automatic Performance Portable Across datasets Across platforms Adaptive to context Adapts granularity of parallelism based on load during execution

  26. Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Summary Department of Computer Science, University of Maryland 26

  27. Evaluation Platform: XMT XMT’s Goals: Ease-of-Programming Easy workflow from PRAM algorithm to XMT program (taught to undergrads, high school) Backwards compatibility with serial code 1 Heavy Core for serial core (Master TCU) A plurality of simple parallel cores (TCUs) Good Performance High-Bandwidth Low-Latency Interconnect Efficient HW scheduling of outer parallelism Fast & Scalable global synchronization

  28. Evaluation Platform We chose XMT because it: Is easy-to-program Productivity is important for general-purpose parallelism Does Not compromise on performance Has more than a few (4 or 8) cores Allows to demonstrate the scalability of LBS We ran our benchmarks on a 75MHz XMT FPGA prototype: 64 TCUs in 8 clusters, 8 shared 32K L1 cache modules, one mult/divunit per cluster. 4 prefetch buffers per TCU, and 32 integer registers. No floating-point support.

  29. Benchmarks Used

  30. Comparing LBS to AP APdefault APXMT LBS

  31. Comparing LBS to AP:Results Overall: LBS is 16.2% faster than APxmt LBS is 38.9% faster than APdefault LBS is faster on fine-grain iterations FW, bfs, SpMV LBS & AP are comparable on the rest

  32. Comparing LBS to SP SP needs manual tuning (sst) LBS doesn’t Two Training Scenarios: Common: SPtr/ex Train on one dataset, execute on different one Uncommon: SPex/ex Train and execute on the same dataset The gap between SPtr/ex and SPex/ex shows SP’s lack of performance portability across datasets.

  33. Comparing LBS to SP

  34. LBS vs SP: Results Overall (vs SPtr/ex): LBS is 19.5% faster on average Up to 65.7% faster (SpMV) Falls behind only on tsp and only by 2.2% Overall (vs SPex/ex): LBS 3.8% faster on average Only falls behind on tsp (by 2.2%) So even in the unrealistic case LBS is preferable

  35. Additional Comparisons Experimental Comparisons vs other work-stealing algorithms SWS, EBS1, LBS1 Serializing inner Parallelism Quantitative Comparison of schedulers in terms of # deque transactions # synchronization points needed …read the paper ! 

  36. LBS’s Scalability:Speedups vs 1 TCU Super-linear speedups explained by complex cache behavior Average Speedup is Linear : LBS is Scalable

  37. Performance Benefits:Speedups vs. Serial on MTCU Good speedups even for irregular benchmarks qs, tsp, queens, bfs XMT & LBS is a promising combination

  38. Conclusions LBS significantly reduces splitting overheads and delivers superior performance The combination of XMT & LBS seems promising for general-purpose parallel computing How will LBS perform on traditional multi-cores?

  39. Questions ?

More Related