Transient Fault Detection via Simultaneous Multithreading

Transient Fault Detection via Simultaneous Multithreading Steven K. Reinhardt stever@eecs.umich.edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Technology Compaq Computer Corporation Shrewsbury, Massachusetts 27th Annual International Symposium on Computer Architecture (ISCA), 2000

Transient Faults • Faults that persist for a “short” duration • Cause: cosmic rays (e.g., neutrons) • Effect: knock off electrons, discharge capacitor • Solution • no practical absorbent for cosmic rays • 1 fault per 1000 computers per year (estimated fault rate) • Future is worse • smaller feature size, reduce voltage, higher transistor count, reduced noise margin

R1  (R2) R1  (R2) microprocessor microprocessor Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Fault Detection in Compaq Himalaya System Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Fault Detection via Simultaneous Multithreading R1  (R2) R1  (R2) THREAD THREAD Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Threads ? Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Simultaneous Multithreading (SMT) Thread1 Thread2 Instruction Scheduler Functional Units Example: Alpha 21464

Simultaneous & Redundantly Threaded Processor (SRT) SRT = SMT + Fault Detection + Less hardwarecompared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT

SRT Design Challenges • Lockstepping doesn’t work • SMT may issue same instruction from redundant threads in different cycles • Must carefully fetch/schedule instructions from redundant threads • branch misprediction • cache miss Disclaimer: This talk focuses only on fault detection, not recovery

Contributions & Outline • Sphere of Replication (SoR) • Output comparison for SRT • Input replication for SRT • Performance Optimizations for SRT • SRT outperforms on-chip replicated microprocessors • Related Work • Summary

Sphere of Replication (SoR) • Logical boundary of redundant execution within a system • Trade-off between information, time, & space redundancy Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System

Example Spheres of Replication Sphere of Replication Sphere of Replication Pipeline1 Pipeline2 Microprocessor Microprocessor InputReplication OutputComparison InputReplication OutputComparison Memory covered by ECC RAID array covered by parity Servernet  covered by CRC Instruction cache covered by ECC Data cache covered by ECC Memory covered by ECC RAID array covered by parity Servernet  covered by CRC ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5) Compaq Himalaya

Fp Regs R1  (R2) Instruction Cache R3 = R1 + R7 R1  (R2) Ld /St R8 = R7 * 2 Units Int . Regs Int . Units Thread 0 Thread 1 Sphere of Replication for SRT Fetch PC RUU Fp Units Data Cache Register Decode Rename Excludes instruction and data caches Alternates SoRs possible (e.g., exclude register file)… not in this talk

Output Comparison in SRT Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System Compare & validate output before sending it outside the SoR

Store: ... Store Queue Store: ... Store: ... Store: ... Store: R1  (R2) To Data Cache Output Comparison Store: ... Store: R1  (R2) Output Comparison • <address, data> for stores from redundant threads • compare & validate at commit time • <address> for uncached load from redundant threads • <address> for cached load from redundant threads: not required • other output comparison based on the boundary of an SoR

Input Replication in SRT Sphere of Replication ExecutionCopy 1 ExecutionCopy 2 InputReplication OutputComparison Rest of System Replicate & deliver same input (coming from outside SoR) to redundant copies

add load R1(R2) sub probe cache add load R1  (R2) sub Input Replication • Cached load data • pair loads from redundant threads: too slow • allow both loads to probe cache: false faults with I/O or multiprocessors • Load Value Queue (LVQ) • pre-designated leading & trailing threads LVQ

Input Replication (contd.) • Cached Load Data: alternate solution • Active Load Address Buffer • Special Cases • Cycle- or time-sensitive instructions • External interrupts

Outline • Sphere of Replication (SoR) • Output comparison for SRT • Input replication for SRT • Performance Optimizations for SRT • SRT outperforms on-chip replicated microprocessors • Related Work • Summary

Performance Optimizations • Slack fetch • maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes • Branch Outcome Queue • feed branch outcome from leading to trailing thread • Combine the above two

Baseline Architecture Parameters • L1 instruction cache • 64K bytes, 4-way associative, 32-byte blocks, single ported • L1 data cache • 64K bytes, 4-way associative, 32-byte blocks, four read/write ports • Unified L2 Cache • 1M bytes, 4-way associative, 64-byte blocks • Branch predictor • Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global • PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; • 4K-entry BTB, 16-entry RAS (per thread) • Fetch/Decode/Issue/Commit Width • 8 instructions/cycle (fetch can span 3 basic blocks) • Function Units • 6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply • Fetch to Decode Latency = 5 cycles • Decode to Execution Latency = 10 cycles

Target Architectures • SRT • SMT + fault detection • Output Comparison • Input Replication (Load Value Queue) • Slack Fetch + Branch Outcome Queue • ORH-Dual: On-Chip Replicated Hardware • Each pipeline of dual has half the resources of SRT • Two pipelines share fetch stage (including branch predictor)

Performance Model & Benchmarks • SimpleScalar 3.0 • modified to support SMT by Steve Raasch, U. of Michigan • SMT/Simplescalar modified to support SRT • Benchmarks • compiled with gcc 2.6 + full optimization • subset of spec95 suite (11 benchmarks) • skipped between 300 million and 20 billion instructions • simulated 200 million for each benchmark

SRT vs. ORH-Dual Average improvement = 16%, Maximum = 29%

Recent Related Work • Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection • AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread • DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor

Improvements over Prior Work • Sphere of Replication (SoR) • e.g., AR-SMT register file must be augmented with ECC • e.g., DIVA must handle uncached loads in a special way • Output Comparison • e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR • Input Replication • e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ • Slack Fetch

Summary • Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection • Sphere of replication • Output comparison of committed store instructions • Input replication via load value queue • Slack fetch & branch outcome queue • SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%

Transient Fault Detection via Simultaneous Multithreading

Transient Fault Detection via Simultaneous Multithreading

Presentation Transcript

Simultaneous Multithreading (SMT)

Transient Fault Detection via Simultaneous Multithreading

Redundant Multithreading Techniques for Transient Fault Detection

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

Line Fault Detection

Transient Fault Detection and Recovery via Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

Fault detection

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection

Transient Fault Tolerance via Dynamic Process-Level Redundancy

Fault Detection

Distributed Online Simultaneous Fault Detection for Multiple Sensors

Simultaneous Multithreading (SMT)

Computer Architecture Lec 10 –Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors

Fault detection

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection