1 / 29

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance. Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh. Classic Study on Synchronization. Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91]

brit
Download Presentation

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh

  2. Classic Study on Synchronization • Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91] • Multiprocessors machines • BBN Butterfly, Sequent Symmetry • Microbenchmarks • Little benefit from special hardware support • Handle memory/network contention in software

  3. Case for Hardware Support • Fetch&Op [Laudon et. al., ISCA’97] • Origin 2000 • Microbenchmarks (Counter & Barrier) • QOLB [Kagi et. al., ISCA’97] • Simulations • Microbenchmarks & Applications (Locks) • Better performance with Hardware Support

  4. Our Study • Re-examine synchronization • 64 processor Origin 2000 • New architectures CC-NUMA • New primitives LL-SC • Applications (SPLASH2) and microbenchmarks • Applications : Little benefit from H/W support • Locks : Small performance sometimes • Barriers : Load-imbalance dominates

  5. Outline • Background • Performance evaluation: Microbenchmarks • Synchronization primitives on Origin 2000 • Lock and Barrier algorithms and performance • Performance evaluation: Applications • Is further hardware support valuable ? • Conclusions

  6. LL-SC 2 instructions, Cached Flexible Fetch&Op Special locations, uncached Inflexible e.g. Atomic Swap Performance Tradeoffs : Atomic update • Contention Retries • Contention at Memory Performance Tradeoffs : Wait • Spinning in Cache • Cache Coherence • Spinning Traffic • No Cache Coherence Synchronization Primitives on Origin 2000

  7. Simple One location Available ? No P P P P Simple Lock Algorithms (1) • Atomic test-and-set • LL-SC • Fetch&Op

  8. Ticket Like in a bakery Proportional backoff Next-Ticket Now-Serving 132 125 126 127 132 P P P P Ticket Lock Algorithms (2) • Atomic fetch-and-increment • LL-SC • Fetch&Op 125

  9. MCS Queuing Local spinning Queue 0 0 0 P P P P MCS Queuing Lock Algorithms (3) • Atomic Compare-and-Swap • LL-SC • Not Fetch&Op

  10. Simple (LL-SC) TicketProp (LL-SC) MCS (LL-SC) TicketProp (Fetch&Op) Lock-Delay Microbenchmark

  11. Central Increment a counter Wait on a location Arrived Go 5 No P P P P Central Barrier Algorithms (1) • Atomic fetch-and-increment • LL-SC • Fetch&Op

  12. Tournament Tree of locations Spin on different locations Avoid hotspot and contention 0 0 0 0 0 0 P P P P Tournament Barrier Algorithms (2) • Atomic fetch-and-increment • LL-SC • Fetch&Op

  13. Central (LL-SC) Tournament (LL-SC) Central (Fetch&Op) Hybrid (LL-SC, Fetch&Op) Barrier-Null Microbenchmark

  14. Microbenchmarks Summary • LL-SC • Simplest algorithms perform poorly e.g. Simple lock and Central barrier • Smarter algorithms perform much better • Fetch&Op supports faster synchronization

  15. Outline • Background • Performance evaluation: Microbenchmarks • Performance evaluation: Applications • Is further hardware support valuable ? • Conclusions

  16. Choosing Applications: Methodology • Applications from SPLASH-2 • Undo optimizations (Added locks and barriers) • Problem Size • At least 25 fold speedup on 64 processors • Base case • Best LL-SC lock and barrier

  17. Base Performance

  18. Base : MCS,LL-SC 1.65 Application performance usingDifferent Locks • Better algorithm helps • Fetch&Op traffic hurts

  19. 2.68 Base : Tournament,LL-SC 1.52 Application performance using .Different Barriers . • Load-imbalance dominates • Fetch&Op traffic hurts

  20. Applications Summary • LL-SC • Locks : Better algorithm helps • Barriers : Load imbalance dominates • Fetch&Op • Traffic due to spinning hurts performance • Different from the microbenchmarks

  21. Outline • Background • Performance evaluation: Microbenchmarks • Performance evaluation: Applications • Is further hardware support valuable ? • Locks • Barriers • Conclusions

  22. Raytrace Radiosity Sensitivity to Lock Performance Adding round-trip network delays Extrapolate : 20-30 % improvement from better hardware

  23. When do faster locks help Applications ? • Applications sensitive to Lock performance • Raytrace, Radiosity ( ~ 20 -30 %) • Substantial time in synchronization • Small contended critical sections • Critical section size = actual + lock overhead • Lock overhead dilates the critical section • Effect on performance  size of critical section • 2 Apps : ~ 5 us (1-2 updates to shared locations)

  24. Can we fix contention problems in these cases in the Application ? • Yes. Fix was fairly easy • Raytrace • Global counter Partial reductions • Radiosity • Single buffer allocation queue Multiple • Tasks added to local queue Distribute • Significant performance improvement • Raytrace : 90%, Radiosity: 220%

  25. Barriers • Load-imbalance dominates • Other applications • Well-balanced with little communication • Like the microbenchmarks; Real applications ? • Well-balanced computation & communication • SOR : nearest neighbor on a grid • Barriers : 61 % execution time • Still dominates. Communication Imbalance

  26. Summary & Conclusions • Fetch&op does not help Applications • At least for well-known lock & barrier algorithms • Using applications is important • Little benefit from hardware support • Locks: helps sometimes Fixable • Barriers: load imbalance dominates • Sound Methodology

  27. Tournament barrier with Fetch&Op • Worse performance • Preliminary measurements indicated worse overhead in addition to traffic • Barrier performance did not make a difference in the applications

  28. Small problem size • Raytrace : Decreases lock time • Barnes : Load-imbalance increases • Water-Nsq : Load-imbalance and Serialization • Ocean & SOR : Barrier time remains same • Radiosity & Water-Spatial : Not available

  29. SOR Breakdown Load-imbalance is dominates time spent in barriers

More Related