1 / 18

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng * Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector. “Failure to prepare is preparing to fail…”.

arawn
Download Presentation

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: ShuguangFeng* Shantanu Gupta AminAnsari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector 1

  2. “Failure to prepare is preparing to fail…” • - Benjamin Franklin • …many ways to fail • The distinction between a transient and permanent fault is becoming blurred Electromigration Oxide Breakdown PVT Variation • Transient (“soft”) Faults • Permanent (“hard”) Faults [Dreslinski`10] NTC Computing [Gupta`09] • Rare • Continuous • Periodic • Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors. Cosmic Radiation Packaging Impurities Negative Bias Temperature Instability 2

  3. The Future of Soft Errors One failure per DAY per chip Past Present Future One failure per DAY per 100 chips Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips 3

  4. Realizing a Reliability “Pipeline” • Commodity systems present both challenges and opportunities • Challenge: HW speculation support (if it exists) is limited • Challenge: Cannot afford expensive, heavyweight SW • checkpointing • Opportunity: Typically not running mission-critical applications • Sacrifice a small degree of reliability • Exploit (probabilistic) idempotence in program execution • Vulnerable • Computation • Vulnerable • Computation • Detection • Detection • Diagnosis • Repair • Recovery • Recovery • Generally involves some form of rollback/re-execution • Identify fault site • Restore processor to pre-fault state, before 1) • Resume execution from 1) • Many low-cost detection techniques rely on hardware speculation support • Recent interest in low-cost fault detection • ReStore [DSN`05] • SWAT [ASPLOS`08] • Shoestring [ASPLOS`10] • Not perfect…but very low-cost • Reliable • Output • Reliable • Output 4

  5. The Role of Idempotence • Mathematical Definition: • an operation that can be applied multiple times without changing the result • Computer Science Definition: • a region of code without any • exposed write-after-read • (WAR, anti-) dependencies … X = … … = X … … = X Idempotent code regions can be safely re-executed without additional checkpointing X++ X++ … X Idempotent Non-idempotent 5

  6. Does Idempotence Exist? • Selectively checkpointing a *few* offending stores 6

  7. Challenges to Exploiting Idempotence bb’ • Must identify where to resume execution • Control flow • Rollback distance • Statically identifying optimal rollback distance is inherently intractable • ↑ rollback dist. → ↑ Pr(recoverable) • ↓ rollback dist. → ↑ Pr(idempotent) • Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions bb 1 bb 2 bb 3 X a X bb 4 X bb 5 bb 6 bb 6 bb 7 Execution Path 7

  8. Fault Detected Encore Vision Redirect Control Recovery Recovery … = X Restore State Chkpt X Chkpt X … = X Source Code …= X X++ X++ …= X X++ … Non-idempotent Idempotent … Runtime Behavior (post-fault) Code Partitioning (CFG-based) Instrumentation (per region) Idempotence Analysis (per region) 8

  9. Identifying Idempotence (High-level) • With respect to a point, p, in the CFG… • Reachable Stores (RS) • A store that may execute after p • Guarded Addresses (GA) • An address that is guaranteed to be overwritten before reaching p • Exposed Addresses (EA) • An address that may be referenced by an unguarded load prior to p • Idempotent IFF • EA ∩ RS = Ø bb 1 bb 1 • Additional Details… • 1) Applies to both memory and registers • Static, conservative alias analysis • 2) Scalable hierarchical analysis • Handles cyclic code bb 2 bb 2 bb 3 bb 4 bb 3 bb 3 bb 4 bb 4 bb 6 bb 6 bb 5 bb7 bb7 bb 8 bb 8 9

  10. Code Instrumentation Upon Fault Detection bb r bb r … 1: Store A … bb 0 bb 1 Recovery Code • Encore Heuristics • Selectively prune dynamically-dead code • ↓ offending stores → ↑ Pr(idempotent) • 2) Selectively fuse adjacent regions • ↑ region size → ↑ Pr(recoverable) • 3) Selectively instrument profitable regions Live-in Checkpointing bb 2 … 2: Store B … 3: Store C … … 4: Load A … 5: Store C … # bb 3 bb 4 … 7: Load B … 8: Load C … … 6: Load B … $ bb 6 bb 5 @ … 9: Store A … 10: Store B … 11: Load C … bb7 # MemCopy B Save Address[B] Save R1 Save R2 … Save Rn *Restore B Restore R1 Restore R2 … Restore Rn *Restore B “On-demand” Checkpointing $ … 12: Store C … @ + + bb 8 10

  11. Lightweight Checkpointing 1 reg2mem store 1 mem2mem copy 1 stack ptr increment data_N addr_N Stack grows dynamically to accommodate checkpoint storage STACK Encore Extensions data_1 addr_1 1 reg2mem store data_0 addr_0 Live-in Registers Local Variables Traditional Call Stack Return Address Input Parameters Stack Pointer Frame Pointer 11

  12. Evaluation Methodology • Program analysis/instrumentation performed in the LLVM compiler • In-order, single-issue, embedded-class processor • Dynamic instruction model based on profiled execution • Reliability coverage • Analytical model in lieu of traditional fault injection • Decouples evaluation from microarchitectural details 12

  13. Inherent Idempotence 0% (dynamically-dead) <5% <10% 76% of application code is naturally idempotent 13

  14. Dynamic Execution Breakdown • Impact of detection latency • If control has left the region containing the original fault site, re-execution cannot correct the error 91% of execution time is spent within recoverable regions 14

  15. Full System “Coverage” Existing (~100 instrs) Future (~10 instrs) Future (~1000 instrs) 93% − 99.99% coverage, highly application dependent 15

  16. Overheads 3% − 22% performance degradation 16

  17. Summary • Large portions of applications, across domains, are (probabilistically) idempotent • Encore is a software-only solution that exploits this property to provide low-cost fault recovery • 97% of faults on average are recoverable with current detection schemes • @ 15% performance penalty • Implementing Encore in a runtime system / virtual machine has the potential to yield even better results • Larger dynamic traces v. static intervals • Dynamic v. static memory analysis 17

  18. Questions? http://cccp.eecs.umich.edu 18

More Related