1 / 27

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors. Motivation. Multiprocessor architectures sprouting everywhere large compute servers small servers, desktops chip multiprocessors High energy consumption a problem – more so in MPs

deiter
Download Presentation

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Thrifty BarrierEnergy-Aware Synchronizationin Shared-Memory Multiprocessors

  2. Motivation • Multiprocessor architectures sprouting everywhere • large compute servers • small servers, desktops • chip multiprocessors • High energy consumption a problem – more so in MPs • Most power-aware techniques tailored at uniprocessors • Multiprocessors present unique challenges • processor co-ordination, synchronization The Thrifty Barrier – Li, Martínez, and Huang

  3. Case: Barrier Synchronization • Fast threads spin-wait for slower ones • Spin-wait wasteful by definition • quick reaction • but only last iteration useful compute spin-wait The Thrifty Barrier – Li, Martínez, and Huang

  4. Proposal: Thrifty Barrier • Reduce spin-wait energy waste in barriers • leverage existing processor sleep states (e.g. ACPI) • Minimize impact on execution time • achieve timely wake-up conventional thrifty The Thrifty Barrier – Li, Martínez, and Huang

  5. Challenges • Should sleep? • transition times (sleep + wake-up) non-negligible • What sleep state? • more energy savings → longer transition times • When to wake up? • early w.r.t. barrier release → may hurt energy savings • late w.r.t. barrier release → may hurt performance Must predict barrier stall time accurately The Thrifty Barrier – Li, Martínez, and Huang

  6. Findings • Many barrier stall times large enough to leverage sleep states • Stall times predictable • discriminate through PC indexing • predict indirectly using barrier interval times • Timely wake-up: combination of two mechanisms • coherence message bounds wake-up latency • watchdog timer anticipates wake-up The Thrifty Barrier – Li, Martínez, and Huang

  7. Thrifty Barrier Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang

  8. Sleep Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang

  9. Predicting Stall Time • Splash-2’s FMM example: 3 important barriers, 4 iterations • randomly picked thread (always the same) • PC indexing reduces variability • Interval time (BIT) more stable metric than stall time (BST) The Thrifty Barrier – Li, Martínez, and Huang

  10. Stall Time vs. Interval Time • Barriers separate computation phases • PC indexing reduces variability • Barrier stall time (BST) varies considerably • even with PC indexing • barrier-, but also thread-dependent • computation shifts among threads across invocations • Barrier interval time (BIT) varies much less • quite stable if PC indexing used • barrier-, but not thread-dependent • last-value prediction ok for most applications The Thrifty Barrier – Li, Martínez, and Huang

  11. Predicting Stall Time Indirectly • Can use BIT to predict BST indirectly • compute time measurable upon arrival to barrier • subtract from predicted BIT to derive predicted BST • How to manage time info? Computet BSTt BIT The Thrifty Barrier – Li, Martínez, and Huang

  12. Managing Time Info • Threads depart from barrier instance b-1 toward instance b • Each thread t has local record of release timestamp BRTSt,b-1 • Assumptions: • no global clock • local wallclock active even if CPU sleeps • all CPUs same nominal clock frequency b-1 b BRTSt,b-1 The Thrifty Barrier – Li, Martínez, and Huang

  13. Managing Time Info • Thread t arrives, knowing BRTSt,b-1, Computet,b • make prediction pBITb • derive pBSTt,b = pBITb – Computet,b • use pBSTt,b to pick sleep state (if warranted) • best fit based on transition time b-1 b Computet,b pBSTt,b pBITb BRTSt,b-1 The Thrifty Barrier – Li, Martínez, and Huang

  14. Managing Time Info • Last thread u arrives, knowing BRTSu,b-1 • derive actualBITb = time( ) – BRTSu,b-1 • update (shared) predictor with BITb • release barrier b-1 b BITb BRTSu,b-1 The Thrifty Barrier – Li, Martínez, and Huang

  15. Managing Time Info • Every thread t (possibly after waking up late) • read BITb from updated predictor • compute actualBRTSt,b = BRTSt,b-1 + BITb • Threads never use timestamps (BRTS) from other threads • no global clock is needed b-1 b BITb BRTSt,b-1 BRTSt,b * The Thrifty Barrier – Li, Martínez, and Huang

  16. Thrifty Barrier Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang

  17. Wake-up Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang

  18. Wake-up Mechanism • Communicate barrier completion to sleeping CPUs • signal sent to CPU pin • options: external vs. internal wake-up • External (passive): initiated by processor that releases barrier • leverage coherence protocol – invalidation to spinlock • must supply spinlock address to cache controller • Internal (active): triggered by watchdog timer • program with predicted BST before going to sleep The Thrifty Barrier – Li, Martínez, and Huang

  19. Early vs. Late Wake-up • Early wake-up (underprediction) • energy waste – residual spin • Late wake-up (overprediction) • possible impact on execution time • External wake-up guarantees late wake-up (but bounded) • Internal wake-up can lead to both (late not bounded) • Our approach: hybrid wake-up • external provides upper bound • internal strives for timely wake-up using prediction The Thrifty Barrier – Li, Martínez, and Huang

  20. Other Considerations (see paper) • Sleep states that do not snoop for coherence requests • flush dirty data before sleeping • defer invalidations to clean data • Overprediction threshold • case of frequent, swinging BITs of modest size • turn off prediction if overpredict beyond threshold • Interaction with context switching and I/O • underprediction threshold • Time sharing issues: multiprogramming, overthreading The Thrifty Barrier – Li, Martínez, and Huang

  21. Experimental Setup • Simultated system: 64-node CC-NUMA • 6-way dynamic superscalar • L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk • 16B/4clk memory bus, 60ns SDRAM • hypercube, wormhole, 4clk pipelined routers • 16clk pin to pin • Energy modeling: Wattch (CPU + L1 + L2) • sleep states along lines of Pentium family The Thrifty Barrier – Li, Martínez, and Huang

  22. Experimental Setup • All Splash-2 applications except: • Raytrace – no barriers • LU – better version w/o barriers widely available • Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10% The Thrifty Barrier – Li, Martínez, and Huang

  23. Energy Savings The Thrifty Barrier – Li, Martínez, and Huang

  24. Performance Impact The Thrifty Barrier – Li, Martínez, and Huang

  25. Related Work Highlights • Quite a bit of work in uniprocessor domain • Elnozahy et al. • server farms, clusters • thirfty barrier targets shared memory, parallel apps. • Moshovos et al., Saldanha and Lipasti • energy-aware cache coherence • prob. compatible with and complementary to thrifty barrier The Thrifty Barrier – Li, Martínez, and Huang

  26. Conclusions • Energy-aware MP mechanisms can and should be pursued • Case of energy-aware barrier synchronization • simple indirect prediction of barrier stall time • hybrid wake-up scheme to minimize impact on exec. time • Encouraging results; target applications • 17% avg. energy savings • 2% avg. performance impact The Thrifty Barrier – Li, Martínez, and Huang

  27. The Thrifty BarrierEnergy-Aware Synchronizationin Shared-Memory Multiprocessors

More Related