1 / 12

Autonomic Computing via Dynamic Self-Repair

Autonomic Computing via Dynamic Self-Repair. Daniel J. Sorin Department of Electrical & Computer Engineering Duke University. A Computing Challenge for NASA. NASA relies on computers NASA is much more demanding than most users Must operate in harsh environments that cause hard faults

darryl
Download Presentation

Autonomic Computing via Dynamic Self-Repair

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering Duke University

  2. A Computing Challenge for NASA • NASA relies on computers • NASA is much more demanding than most users • Must operate in harsh environments that cause hard faults • Must operate correctly for years • Must not require human to repair problems • Our goal • Designing autonomic computer systems • Permanent faults will occur and computer will handle them

  3. But Isn’t This a Solved Problem? • We could just use TMR (triple modular redundancy) CPU voter CPU output CPU • But too much power usage to be feasible • Especially for modern microprocessors

  4. Key Observation • Computer hardware is already modular • Improves performance • Simplifies design and verification • Modular exists at many levels • Multiple processors per chip (CMP) • Multiple thread contexts per processor • Multiple functional units (e.g., adders) per processor • Multiple 4-bit adders in 64-bit adder • Multiple 1-bit adders in 4-bit adder • Etc. We can leverage this modularity!

  5. Modular Redundancy • If computer has N widgets, add extra widget(s) • Then provide: • Ability to detect errors • Ability to diagnose hard faults (that cause errors) • Ability to reconfigure and map in spare widget • Cost: 1/N (or 2/N) instead of 2*N for TMR • Benefit: can sometimes even be better than TMR! • Simplistic example: • For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders) • Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)

  6. HMR: Hierarchical Modular Redundancy • Provide modular redundancy at many levels • Processors, adders, multipliers, etc. • Engineering issues involved in HMR • Allocating resources • Managing costs

  7. Allocating Resources • For given hardware budget, how to allocate it • Which level to allocate spares? • Better to have extra processor? • Or extra adders in each processor? • Or some combination of both? • How many spares at each level? • Can a spare be mapped in anywhere in system?

  8. Managing Costs • Costs: extra modules, wires, and multiplexers • Example: 3-bit addition, with module = 1-bit adder A1 adder C1 mux B1 mux adder A2 mux C2 mux B2 mux adder mux C3 mux A3 adder B3

  9. Current Research Thrust #1 • Explore modular redundancy within microprocessor • Add extra array entries • In reorder buffer (ROB), branch history table (BHT), etc. • Add extra functional units • Adders, multipliers, etc. • For error detection • Use “DIVA” or redundant threads • For hard fault diagnosis • Use threshold error counters • For reconfiguration • Use extra wires and multiplexers Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004

  10. Current Research Thrust #2 • Explore modular redundancy within 64-bit adder • Start with 64-bit carry lookahead adder (CLA) • Hierarchy of 4-bit CLA modules • Add 2 extra modules • Detect errors as before • Diagnose with counters and pattern matching • Based on error counter values, can diagnose fault! • Reconfigure with clever multiplexing scheme

  11. Conclusions and Future Work • Hierarchical Modular Redundancy can provide high reliability at relatively low cost • Future directions • Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.) • Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding) • High-level: HMR for chip multiprocessors

  12. Acknowledgments Several collaborators on this work • Co-Investigator Prof. Sule Ozev (Duke ECE) • Fred Bower (Duke CS grad and IBM) • Mahmut Yilmaz (Duke ECE grad) • Derek Hower (Duke ECE undergrad)

More Related