1 / 16

Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs

Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs. M Wasiur Rashid , Michael Huang University of Rochester. Motivation for Thread-Level Redundancy. Noise-induced hardware errors important threat Shrinking transistors fundamentally more vulnerable

waldron
Download Presentation

Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Highly-Decoupled Thread-Level Redundancy for Parallel Programs M Wasiur Rashid, Michael Huang University of Rochester

  2. Motivation for Thread-Level Redundancy • Noise-induced hardware errors important threat • Shrinking transistors fundamentally more vulnerable • Scaling increases noise sources and victims • Error frequency will rise in unprotected circuits • Efficient and effective protection against errors in logic: • Important: logic errors rising (Serfeit et al. IRPS’01, Shivakuma et al. DSN’02) • Challenging: no ECC-equivalent partial redundancy • TLR natural and well understood: AR-SMT, SRT, CRT, SRTR, CRTR • Advantages of TLR over circuit/device-level redundancies • Flexible: can easily turn on or off on demand • Avoids fundamental issues of lower-level redundancies

  3. Designs Goals • Support parallel programs efficiently • Not a trivial extension of supporting single-threaded apps • Cover memory subsystem logic (coherence, consistency) • Manage natural non-determinism in parallel execution • Decouple redundancy support from core logic • Minimize impact to critical path • Minimize design intrusion to core logic • Decouple the timing of redundant threads • Do not require lock-stepping • Validation can happen long after retirement • To tolerate long latencies in communicating and validating results

  4. High-Level Overview • Two wavefronts move indep. (inc. mem. hierarchy) • Compare arch. state every epoch so large buffering cap. • Non-determinism and buffer all implemented in off-path support

  5. Decoupling with Post-Commit Buffer • PCB handles redundancy, L1 handles semantic processing • PCB keeps written cache lines and write back after validation • L1 need not worry about writing back (a dirty line is discarded) • Stores from processor also writes into PCB • Timing critical path of L1 intact • An L1 miss or coherence activity need to search PCB L1 Cache PCB

  6. Challenges to Address • Isn’t PCB a very large store queue (and therefore impractical)? – No • PCB is searched on a miss – not timing critical, but… • It does have many more entries (~100s) to be useful • If multiple versions exists, search can be very slow • Either sequentially search each segment or use priority encoder • Frequent searches undesirable energy-wise

  7. Using States to Address Multiple Copies • 3 states: Valid, Invalid, Superseded • Superseded lines only for committing, do not participate in search • Guarantees that only one valid version (at most) in any PCB • Searches are always parallel, no priority encoding needed

  8. V Y V X bloom filter Y SD Using Pointers as a Filter • If line is present in the cache, no need to search PCB • Pointer also reduces bloom filter clogging tag data ptr tag data ptr tag data state new Y X V Z old L1 Cache PCB

  9. Effectiveness of the Optimizations • Setup: multiprocessor simulator based on Simplescalar. SPLASH2 benchmarks plus two other shared memory programs. • Only 0.67% of PCB searches remain • Pointer and bloom-filter filter out about half each. • Pointer works well to filter same-processor searches. • Bloom filter works well for remote-processor requests. • Without pointers, false positives of bloom filter are 21X-800X higher.

  10. The Issue of Non-Determinism Verification Wavefront Computing Wavefront • Non-determinism in parallel lead to different outcome • Discrepancies appear as soft errors and can’t be addressed by rollbacks • Possible solutions • Eliminate non-determinism completely – lockstepping • Ignore root cause of non-determinism and address symptom: passing load results via, e.g., load value queue (LVQ) • Our approach: throttle retirement/fetch to maintain race outcome st x, 0 st x, 1 st x, 0 ld x (1) ld x (0) st x, 1 T1 T2 T1 T2

  11. st x, 0 … add … stall st x, 1 s s + 1 ld x (1) Subepoch-Based Instruction Partitioning Computing Wavefront Verification Wavefront Subepoch s st x, 0 Ti Tj … … ns,j st x, 1 add … ns,j ld x (1) Ti Tj • (Potential) races partition instructions into subepochs • Guarantee races only happen across subepochs • Maintain “lockstepping” of subepochs • Stall fetch or commit stage (sequential consistency) • Guarantees deterministic replay

  12. Other Issues Detailed in the Paper • PCB write-back bandwidth • PCB moderately increases bandwidth • Simple support can cut down increase significantly • Subepoch transition implementation details • Enforcing subepoch boundaries • Not always necessary to guarantee race outcome • Studied three different policies on when to enforce • Storage and energy overhead for all structures

  13. Experimental Setup • Modified SimpleScalar 3.0b simulator modeling CMP • Snoopy based MESI coherence protocol • Sequential memory consistency model • SPLASH-2 benchmark suit, ilink, tsp

  14. Performance Impact • Additional performance impact less than 2.5% on average 8 16 TLR execution

  15. Summary • TLR offers flexible protection with fundamental advantages • Proposed a support with comprehensive coverage 1. PCB decouples redundancy support from core logic With optimizations, dynamic cost is very low 2. Broad-brushed synchrony guarantees race outcome Requires non-intrusive support to throttle retirement • Overall performance impact small

  16. Thank You Questions?

More Related