1 / 18

CS 7810 Lecture 25

CS 7810 Lecture 25. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999. Redundancy. If a processor’s output is error-prone, reliability can be provided with redundancy. Input Program. Primary Core. Checker

myungm
Download Presentation

CS 7810 Lecture 25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999

  2. Redundancy • If a processor’s output is error-prone, reliability • can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit

  3. Redundancy • If a processor’s output is error-prone, reliability • can be provided with redundancy Input Program Primary Core Checker Core Checker Core Verify & Commit One checker can detect errors. For recovery, we may need another checker or some other form of redundancy

  4. Why Redundancy? • Soft Errors: A high energy particle can strike a device and • deposit enough charge to flip the value Input Program • Cosmic rays • Alpha particles Primary Core Checker Core Verify & Commit

  5. Why Redundancy? • Soft Errors: voltage spikes or noise Input Program • Crosstalk • di/dt • Lower voltages Primary Core Checker Core Verify & Commit

  6. Why Redundancy? • Allows unverified or aggressively clocked primary cores Input Program • Functionally incorrect core: some corner case slips through • Electrically incorrect core: high temperature causes a circuit to not meet the timing constraint Primary Core Checker Core Verify & Commit

  7. DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs ALU D-$ Arch Regs LR3 + LR7  LR15 4 8 12 If both checks succeed, write 12 into LR15 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12

  8. Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs • and datapath • The checker core is a simple in-order pipeline – easy to • design and verify • An error in an earlier stage (LR3 instead of LR2) can be • detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) • – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade • primary thread

  9. Recovery • The architected register file and data cache are ECC • protected – when an error is detected, it is assumed • that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: • e.g. instruction that produces R5 is waiting for R5 to be • produced (instead of R4) • A timeout in the checker signals an error

  10. Redundant Multi-Threading • Execute two threads in parallel (CMP or SMT) – each • thread maintains its own register state • Threads execute as in a conventional processor, except • trailing thread commits after verifying result • leading thread commits stores to a buffer – these get written to cache/memory only after verification • load values of the leading thread are sent to trailing thread, so trailing thread never accesses data cache • branch outcomes are also sent to trailing thread Reg results, load values, branch outcomes Leading Thread Trailing Thread Store values

  11. Fault Model • A single error in either core can be detected • Since loads are not replicated, the load/store datapath • must be ECC protected • For recovery, a second checker thread is required • ECC in the checker register file will enable recovery • in most cases without a second checker

  12. RMT on SMT/CMP + SMT does not require inter-core traffic – values can be read from shared register file/data cache – Single thread performance may be degraded – Each redundant instr executes on high-power pipeline + Trailing CMP core can be a simple in-order processor  low power/area overheads + Trailing core’s frequency can be independently controlled + Heterogeneous CMP where cores can be dynamically employed for throughput/reliability + Lower probability for errors

  13. Parallelization of Trailing Thread Sequential Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Is it more power-efficient to execute the verification thread in parallel?

  14. Parallelization of Trailing Thread Sequential Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 If the trailing cores are frequency-scaled, dynamic power does not change, but leakage power increases If the trailing cores are frequency-and-voltage scaled, dynamic power decreases, and leakage power increases

  15. Error Types

  16. Acronyms!! • MTTF & MTBF: Mean time to/between failures • Errors are either SDC (silent data corruption) or DUE • (detected unrecoverable errors) • Many errors get masked: • ACE bits: these bits are required for architecturally • correct execution • un-ACE bits: these bits do not affect the final output • AVF: architecture vulnerability factor (the percentage of • time/space that a structure holds ACE state)

  17. Partial Coverage • RMT covers faults in the entire core (almost!) • If that is too expensive, provide error coverage in • specific structures to reduce error probabilities • Are there ways to ensure that an instruction spends less • time in architecturally vulnerable structures?

  18. Title • Bullet

More Related