1 / 25

Transient Fault Detection via Simultaneous Multithreading

Transient Fault Detection via Simultaneous Multithreading. Steven K. Reinhardt stever@eecs.umich.edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan. Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Technology

Download Presentation

Transient Fault Detection via Simultaneous Multithreading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transient Fault Detection via Simultaneous Multithreading Steven K. Reinhardt stever@eecs.umich.edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Technology Compaq Computer Corporation Shrewsbury, Massachusetts 27th Annual International Symposium on Computer Architecture (ISCA), 2000

  2. Transient Faults • Faults that persist for a “short” duration • Cause: cosmic rays (e.g., neutrons) • Effect: knock off electrons, discharge capacitor • Solution • no practical absorbent for cosmic rays • 1 fault per 1000 computers per year (estimated fault rate) • Future is worse • smaller feature size, reduce voltage, higher transistor count, reduced noise margin

  3. R1  (R2) R1  (R2) microprocessor microprocessor Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Fault Detection in Compaq Himalaya System Replicated Microprocessors + Cycle-by-Cycle Lockstepping

  4. Fault Detection via Simultaneous Multithreading R1  (R2) R1  (R2) THREAD THREAD Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Threads ? Replicated Microprocessors + Cycle-by-Cycle Lockstepping

  5. Simultaneous Multithreading (SMT) Thread1 Thread2 Instruction Scheduler Functional Units Example: Alpha 21464

  6. Simultaneous & Redundantly Threaded Processor (SRT) SRT = SMT + Fault Detection + Less hardwarecompared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT

  7. SRT Design Challenges • Lockstepping doesn’t work • SMT may issue same instruction from redundant threads in different cycles • Must carefully fetch/schedule instructions from redundant threads • branch misprediction • cache miss Disclaimer: This talk focuses only on fault detection, not recovery

  8. Contributions & Outline • Sphere of Replication (SoR) • Output comparison for SRT • Input replication for SRT • Performance Optimizations for SRT • SRT outperforms on-chip replicated microprocessors • Related Work • Summary

  9. Sphere of Replication (SoR) • Logical boundary of redundant execution within a system • Trade-off between information, time, & space redundancy Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System

  10. Example Spheres of Replication Sphere of Replication Sphere of Replication Pipeline1 Pipeline2 Microprocessor Microprocessor InputReplication OutputComparison InputReplication OutputComparison Memory covered by ECC RAID array covered by parity Servernet  covered by CRC Instruction cache covered by ECC Data cache covered by ECC Memory covered by ECC RAID array covered by parity Servernet  covered by CRC ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5) Compaq Himalaya

  11. Fp Regs R1  (R2) Instruction Cache R3 = R1 + R7 R1  (R2) Ld /St R8 = R7 * 2 Units Int . Regs Int . Units Thread 0 Thread 1 Sphere of Replication for SRT Fetch PC RUU Fp Units Data Cache Register Decode Rename Excludes instruction and data caches Alternates SoRs possible (e.g., exclude register file)… not in this talk

  12. Output Comparison in SRT Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System Compare & validate output before sending it outside the SoR

  13. Store: ... Store Queue Store: ... Store: ... Store: ... Store: R1  (R2) To Data Cache Output Comparison Store: ... Store: R1  (R2) Output Comparison • <address, data> for stores from redundant threads • compare & validate at commit time • <address> for uncached load from redundant threads • <address> for cached load from redundant threads: not required • other output comparison based on the boundary of an SoR

  14. Input Replication in SRT Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System Replicate & deliver same input (coming from outside SoR) to redundant copies

  15. add load R1(R2) sub probe cache add load R1  (R2) sub Input Replication • Cached load data • pair loads from redundant threads: too slow • allow both loads to probe cache: false faults with I/O or multiprocessors • Load Value Queue (LVQ) • pre-designated leading & trailing threads LVQ

  16. Input Replication (contd.) • Cached Load Data: alternate solution • Active Load Address Buffer • Special Cases • Cycle- or time-sensitive instructions • External interrupts

  17. Outline • Sphere of Replication (SoR) • Output comparison for SRT • Input replication for SRT • Performance Optimizations for SRT • SRT outperforms on-chip replicated microprocessors • Related Work • Summary

  18. Performance Optimizations • Slack fetch • maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes • Branch Outcome Queue • feed branch outcome from leading to trailing thread • Combine the above two

  19. Baseline Architecture Parameters • L1 instruction cache • 64K bytes, 4-way associative, 32-byte blocks, single ported • L1 data cache • 64K bytes, 4-way associative, 32-byte blocks, four read/write ports • Unified L2 Cache • 1M bytes, 4-way associative, 64-byte blocks • Branch predictor • Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global • PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; • 4K-entry BTB, 16-entry RAS (per thread) • Fetch/Decode/Issue/Commit Width • 8 instructions/cycle (fetch can span 3 basic blocks) • Function Units • 6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply • Fetch to Decode Latency = 5 cycles • Decode to Execution Latency = 10 cycles

  20. Target Architectures • SRT • SMT + fault detection • Output Comparison • Input Replication (Load Value Queue) • Slack Fetch + Branch Outcome Queue • ORH-Dual: On-Chip Replicated Hardware • Each pipeline of dual has half the resources of SRT • Two pipelines share fetch stage (including branch predictor)

  21. Performance Model & Benchmarks • SimpleScalar 3.0 • modified to support SMT by Steve Raasch, U. of Michigan • SMT/Simplescalar modified to support SRT • Benchmarks • compiled with gcc 2.6 + full optimization • subset of spec95 suite (11 benchmarks) • skipped between 300 million and 20 billion instructions • simulated 200 million for each benchmark

  22. SRT vs. ORH-Dual Average improvement = 16%, Maximum = 29%

  23. Recent Related Work • Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection • AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread • DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor

  24. Improvements over Prior Work • Sphere of Replication (SoR) • e.g., AR-SMT register file must be augmented with ECC • e.g., DIVA must handle uncached loads in a special way • Output Comparison • e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR • Input Replication • e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ • Slack Fetch

  25. Summary • Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection • Sphere of replication • Output comparison of committed store instructions • Input replication via load value queue • Slack fetch & branch outcome queue • SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%

More Related