1 / 24

Transient Fault Detection via Simultaneous Multithreading

Transient Faults . Faults that persist for a

london
Download Presentation

Transient Fault Detection via Simultaneous Multithreading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Transient Fault Detection via Simultaneous Multithreading Just introduce Steve and yourself. Just introduce Steve and yourself.

    2. Transient Faults Faults that persist for a “short” duration Cause: cosmic rays (e.g., neutrons) Effect: knock off electrons, discharge capacitor Solution no practical absorbent for cosmic rays 1 fault per 1000 computers per year (estimated fault rate) Future is worse smaller feature size, reduce voltage, higher transistor count, reduced noise margin Get thru this slide quickly Get thru this slide quickly

    3. Fault Detection in Compaq Himalaya System Get thru this slide quickly Replication is completely in hardware, not visible to OSGet thru this slide quickly Replication is completely in hardware, not visible to OS

    4. Fault Detection via Simultaneous Multithreading Transition to this more smoothly, cost-performance tradeoffTransition to this more smoothly, cost-performance tradeoff

    5. quickly quickly

    6. Simultaneous & Redundantly Threaded Processor (SRT) + Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT

    7. SRT Design Challenges Lockstepping doesn’t work SMT may issue same instruction from redundant threads in different cycles Must carefully fetch/schedule instructions from redundant threads branch misprediction cache miss

    8. Contributions & Outline Sphere of Replication (SoR) Output comparison for SRT Input replication for SRT Performance Optimizations for SRT SRT outperforms on-chip replicated microprocessors Related Work Summary

    9. Sphere of Replication (SoR) SRT: time & space redundancy, unlike prior which is space identify boundaries where redundancy ends … SRT: time & space redundancy, unlike prior which is space identify boundaries where redundancy ends …

    11. Sphere of Replication for SRT SRT derived from SMT SMT pipeline looks like uniprocessor pipeline But, have mix of instructions from two or more threads Here we have corresponding loads from two threads in RUU/IQ SoR includes IQ (e.g., load) space redundancy time redundancy SoR combines both logical and physical replicationSRT derived from SMT SMT pipeline looks like uniprocessor pipeline But, have mix of instructions from two or more threads Here we have corresponding loads from two threads in RUU/IQ SoR includes IQ (e.g., load) space redundancy time redundancy SoR combines both logical and physical replication

    12. Output Comparison in SRT

    13. <address, data> for stores from redundant threads compare & validate at commit time Output Comparison Note that we don’t do output comparison on all instructions Only selected onesNote that we don’t do output comparison on all instructions Only selected ones

    14. Input Replication in SRT

    15. Input Replication Cached load data pair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or multiprocessors Load Value Queue (LVQ) pre-designated leading & trailing threads mention leading thread executes load out-of-order & speculatively trailing thread doesn’tmention leading thread executes load out-of-order & speculatively trailing thread doesn’t

    16. Input Replication (contd.) Cached Load Data: alternate solution Active Load Address Buffer Special Cases Cycle- or time-sensitive instructions External interrupts

    18. Performance Optimizations Slack fetch maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes Branch Outcome Queue feed branch outcome from leading to trailing thread Combine the above two

    19. Baseline Architecture Parameters

    20. Target Architectures SRT SMT + fault detection Output Comparison Input Replication (Load Value Queue) Slack Fetch + Branch Outcome Queue ORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor)

    21. Performance Model & Benchmarks SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRT Benchmarks compiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructions simulated 200 million for each benchmark

    22. SRT vs. ORH-Dual Performance improves because output comparison and input replication don’t hurt Slack Fetch and Branch outcome queue help Performance improves because output comparison and input replication don’t hurt Slack Fetch and Branch outcome queue help

    23. Recent Related Work Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor Our work on SRT Sphere of replication formalizes the problem e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads Output comparison e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA Input replication e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twiceOur work on SRT Sphere of replication formalizes the problem e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads Output comparison e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA Input replication e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twice

    24. Improvements over Prior Work Sphere of Replication (SoR) e.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special way Output Comparison e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR Input Replication e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ Slack Fetch mention DIVA and AR-SMT don’t distinguish between redundant thread & checker threadmention DIVA and AR-SMT don’t distinguish between redundant thread & checker thread

    25. Summary Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection Sphere of replication Output comparison of committed store instructions Input replication via load value queue Slack fetch & branch outcome queue SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%

More Related