1 / 32

Redundant Multithreading Techniques for Transient Fault Detection

Redundant Multithreading Techniques for Transient Fault Detection. Shubu Mukherjee Michael Kontz Steve Reinhardt. Intel HP (current) Intel Consultant, U. of Michigan. Versions of this work have been presented at ISCA 2000 and ISCA 2002.

varvara
Download Presentation

Redundant Multithreading Techniques for Transient Fault Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U. of Michigan Versions of this work have been presented at ISCA 2000 and ISCA 2002

  2. Transient Faults from Cosmic Rays & Alpha particles + decreasing feature size - decreasing voltage (exponential dependence?) - increasing number of transistors (Moore’s Law) - increasing system size (number of processors) - no practical absorbent for cosmic rays

  3. R1  (R2) R1  (R2) microprocessor microprocessor Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Fault Detection via Lockstepping(HP Himalaya) Replicated Microprocessors + Cycle-by-Cycle Lockstepping

  4. Threads ? Replicated Microprocessors + Cycle-by-Cycle Lockstepping Fault Detection via Simultaneous Multithreading R1  (R2) R1  (R2) THREAD THREAD Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC

  5. Thread1 Thread2 Instruction Scheduler Functional Units Simultaneous Multithreading (SMT) Example: Alpha 21464, Intel Northwood

  6. Redundant Multithreading (RMT) RMT = Multithreading + Fault Detection

  7. Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work

  8. Overview • SRT = SMT + Fault Detection • Advantages • Piggyback on an SMT processor with little extra hardware • Better performance than complete replication • Lower cost due to market volume of SMT & SRT • Challenges • Lockstepping very difficult with SRT • Must carefully fetch/schedule instructions from redundant threads

  9. Sphere of Replication Sphere of Replication • Two copies of each architecturally visible thread • Co-scheduled on SMT core • Compare results: signal fault if different LeadingThread TrailingThread InputReplication OutputComparison Memory System (incl. L1 caches)

  10. Basic Pipeline Dispatch Decode Commit Fetch Execute Data Cache

  11. Dispatch Decode Commit Fetch Execute LVQ Data Cache Load Value Queue (LVQ) • Load Value Queue (LVQ) • Keep threads on same path despite I/O or MP writes • Out-of-order load issue possible

  12. Dispatch Decode Commit Fetch Execute STQ Data Cache Store Queue Comparator (STQ) • Store Queue Comparator • Compares outputs to data cache • Catch faults before propagating to rest of system

  13. Store Queue Comparator (cont’d) Store Queue st ... st 5  [0x120] st ... to data cache Compareaddress & data st ... st ... st 5  [0x120] • Extends residence time of leading-thread stores • Size constrained by cycle time goal • Base CPU statically partitions single queue among threads • Potential solution: per-thread store queues • Deadlock if matching trailing store cannot commit • Several small but crucial changes to avoid this

  14. BOQ Dispatch Decode Commit Fetch Execute Data Cache Branch Outcome Queue (BOQ) • Branch Outcome Queue • Forward leading-thread branch targets to trailing fetch • 100% prediction accuracy in absence of faults

  15. LPQ Dispatch Decode Commit Fetch Execute Data Cache Line Prediction Queue (LPQ) • Line Prediction Queue • Alpha 21464 fetches chunks using line predictions • Chunk = contiguous block of 8 instructions

  16. Chunk 1: end of cache line Chunk 2: taken branch Line Prediction Queue (cont’d) • Generate stream of “chunked” line predictions • Every leading-thread instruction carries itsI-cache coordinates • Commit logic merges into fetch chunks for LPQ • Independent of leading-thread fetch chunks • Commit-to-fetch dependence raised deadlock issues 1F8: add 1FC: load R1(R2) 200: beq 280 204: and 208: bne 200 200: add

  17. Line Prediction Queue (cont’d) • Read-out on trailing-thread fetch also complex • Base CPU “thread chooser” gets multiple line predictions, ignores all but one • Fetches must be retried on I-cache miss • Tricky to keep queue in sync with thread progress • Add handshake to advance queue head • Roll back head on I-cache miss • Track both last attempted & last successful chunks

  18. Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work

  19. Preferential Space Redundancy • SRT combines two types of redundancy • Time: same physical resource, different time • Space: different physical resource • Space redundancy preferable • Better coverage of permanent/long-duration faults • Bias towards space redundancy where possible

  20. PSR Example: Clustered Execution LPQ add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Base CPU has two execution clusters • Separate instruction queues, function units • Steered in dispatch stage

  21. PSR Example: Clustered Execution LPQ 0 0 0 0 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] 0 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Leading thread instructions record their cluster • Bit carried with fetch chunk through LPQ • Attached to trailing-thread instruction • Dispatch sends to oppositecluster if possible

  22. PSR Example: Clustered Execution LPQ IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 add r1,r2,r3 • 99.94% of instruction pairs use different clusters • Full spatial redundancy for execution • No performance impact (occasional slight gain)

  23. Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work

  24. SRT Evaluation • Used SPEC CPU95, 15M instrs/thread • Constrained by simulation environment • 120M instrs for 4 redundant thread pairs • Eight-issue, four-context SMT CPU • 128-entry instruction queue • 64-entry load and store queues • Default: statically partitioned among active threads • 22-stage pipeline • 64KB 2-way assoc. L1 caches • 3 MB 8-way assoc L2

  25. SRT Performance: One Thread • One logical thread  two hardware contexts • Performance degradation = 30% • Per-thread store queue buys extra 4%

  26. SRT Performance: Two Threads • Two logical threads  four hardware contexts • Average slowdown increases to 40% • Only 32% with per-thread store queues

  27. Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work

  28. Chip-Level Redundant Threading • SRT typically more efficient than splitting one processor into two half-size CPUs • What if you already have two CPUs? • IBM Power4, HP PA-8800 (Mako) • Conceptually easy to run these in lock-step • Benefit: full physical redundancy • Costs: • Latency through centralized checker logic • Overheads (misspeculation etc.) incurred twice • CRT combines best of SRT & lockstepping • requires multithreaded CMP cores

  29. LeadingThread A TrailingThread A TrailingThread B Chip-Level Redundant Threading CPU A CPU B LVQ LPQ Stores LVQ LPQ LeadingThread B Stores

  30. CRT Performance • With per-thread store queues, ~13% improvement over lockstepping with 8-cycle checker latency

  31. Summary & Conclusions • SRT is applicable in a real-world SMT design • ~30% slowdown, slightly worse with two threads • Store queue capacity can limit performance • Preferential space redundancy improves coverage • Chip-level Redundant Threading = SRT for CMPs • Looser synchronization than lockstepping • Free up resources for other application threads

  32. More Information • Publications • S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 • Papers available from: • http://www.cs.wisc.edu/~shubu • http://www.eecs.umich.edu/~stever • Patents • Compaq/HP filed eight patent applications on SRT

More Related