1 / 19

Towards a Hardware-Software Co-Designed Resilient System

This research focuses on developing a low-cost, unified reliability solution for handling failures in hardware and software systems. The framework includes detection, recovery, diagnosis, and repair components. The methodology involves fault injection and simulation to understand how hardware faults propagate to software and the latency of detection. The findings show that simple hardware and software approaches are effective in handling over 97% of faults, with recovery achieved through checkpoints. Future work includes improving fault propagation understanding, enhancing detection, and exploring application customizability.

Download Presentation

Towards a Hardware-Software Co-Designed Resilient System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at Urbana-Champaign In collaboration with Pradip Bose (IBM) and Subhasish Mitra (Stanford)

  2. Motivation • Failures will happen in the field • Design defects • Aging • Soft errors • Inadequate burn-in • Aggressive design for power/performance/reliability • … • Low-cost method to detect/recover from all sources of failure? • Reliability problem pervasive across many markets • Traditional solutions (e.g. nMR) too expensive • Must incur low performance, power overhead

  3. A Low-Cost, Unified Reliability Solution • Need handle only faults that propagate to software • Hardware faults appear as software bugs • Leverage software reliability solutions for hardware? • One-size-fits-all near-100%coverage often unnecessary • Solution must be customizable to application needs

  4. Outline • Motivation of Framework • Unified Framework for H/W + S/W Reliability • Understanding the Impact of H/W Failures on S/W • Future Work

  5. CHECKPOINT CHECKPOINT CHECKPOINT Fault Fault Fault Error Error Error Symptom detected Testing Recovery Diagnosis No error Error detected Repair, recovery Repair CHECKPOINT CHECKPOINT CHECKPOINT Ideal: symptom-based detection Error undetected Detection with more overhead Unified Framework for H/W + S/W Reliability • Unified hardware/software co-designed framework • Tackles hardware and software faults • Software-centric solutions with near-zero H/W overhead • Customizable to app needs, flexible for new error sources

  6. Framework Components • Detection: Software symptoms, online testing • Recovery: Software/hardware checkpoint and rollback • Diagnosis: Firmware layer for rollback/replay, online testing • Repair/reconfiguration: Redundant, reconfigurable hardware • Need to understand how hardware faults propagate to S/W • How do hardware faults become visible to software? • What is the latency? • Do H/W faults affect application and/or system state?

  7. Methodology • Microarchitecture-level fault injection • Trade-off between accuracy and simulation time • GEMS timing models for out-of-order processor, memory • Simics full-system simulation of Solaris + UltraSPARC III • SPEC workloads for ten million instructions • Fault model • Stuck-at, bridging faults in many micro-arch structures • Fault detection • Crashes detected through hardware generated fatal traps • Misaligned memory access, RED state, watchdog reset, etc. • Hangs detected using simple hardware hang detector

  8. How do Hardware Faults Propagate to Software? • 97% faults (w/o FPU) detectable with simple H/W & S/W • Need H/W support or S/W monitoring for FPU

  9. How do Hardware Faults Propagate to Software? • 97% faults (w/o FPU) detectable with simple H/W & S/W • Need H/W support or S/W monitoring for FPU • > 50% crashes/hangs in OS

  10. S/W Components Corrupted • 62% of faults corrupt system state • Need to recover system state

  11. Latency to Detection from Application Corruption Total instructions executed between app state corruption and detection • 80% have latency < 100K instr, amenable to H/W recovery • Buffering for 50µs on 2 GHz processor • May need to use software checkpoint/recovery for others

  12. Latency to Detection from OS Corruption OS-only instructions executed between OS state corruption and detection • 92% of injections result in latency of < 100K OS instructions • Amenable to hardware recovery

  13. Summary so far • Hardware faults highly visible • Over 97% of faults in 6 structures result in crashes/hangs • Simple H/W and S/W sufficient • Recovery through checkpointing • S/W and/or H/W checkpoints for application recovery • H/W checkpoints and buffering for OS recovery

  14. Next Steps (1 of 3) • Improving understanding of fault propagation • Accurate fault models, effect of transients, intermittents • Lower-level simulations • Better workloads • Detection • More software level monitoring • Software signals, invariants, perturbations, … • H/W support to aid detection in some structures (e.g., FPU) • Selective backup testing • Recovery • Enhanced detection may reduce latency • Explore software vs. hardware, application customizability

  15. Next Steps (2 of 3) • Diagnosis • Assume rollback/restart mechanism, multicore system Bug detected Rollback to previous checkpoint, restart on original core Original symptom doesn’t recur Original symptom recurs • Transient h/w bug, or • non-deterministic s/w bug • Continue execution • … Deterministic s/w bug, or Permanent h/w bug Rollback, restart on different core No symptom Symptom Permanent defect in original core Deterministic s/w bug

  16. Next Steps (3 of 3) • Repair/reconfigure • What should be the right field configurable unit? • Core, FU, array entries? • Avoidance • Dynamic reliability management • Implementation architecture • Hardware + firmware + OS • Itanium machine check architecture has hooks

  17. Thank You Questions?

  18. Backup Slides

  19. Types of fatal traps • Faults cause different fatal traps thrown before crashes • Junk data access leads to memory misalignment • Repeatedly trapping leads to RED state

More Related