Autonomic computing via dynamic self repair
Sponsored Links
This presentation is the property of its rightful owner.
1 / 12

Autonomic Computing via Dynamic Self-Repair PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Autonomic Computing via Dynamic Self-Repair. Daniel J. Sorin Department of Electrical & Computer Engineering Duke University. A Computing Challenge for NASA. NASA relies on computers NASA is much more demanding than most users Must operate in harsh environments that cause hard faults

Download Presentation

Autonomic Computing via Dynamic Self-Repair

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Autonomic Computing via Dynamic Self-Repair

Daniel J. Sorin

Department of Electrical & Computer Engineering

Duke University

A Computing Challenge for NASA

  • NASA relies on computers

  • NASA is much more demanding than most users

    • Must operate in harsh environments that cause hard faults

    • Must operate correctly for years

    • Must not require human to repair problems

  • Our goal

    • Designing autonomic computer systems

    • Permanent faults will occur and computer will handle them

But Isn’t This a Solved Problem?

  • We could just use TMR (triple modular redundancy)






  • But too much power usage to be feasible

    • Especially for modern microprocessors

Key Observation

  • Computer hardware is already modular

    • Improves performance

    • Simplifies design and verification

  • Modular exists at many levels

    • Multiple processors per chip (CMP)

    • Multiple thread contexts per processor

    • Multiple functional units (e.g., adders) per processor

    • Multiple 4-bit adders in 64-bit adder

    • Multiple 1-bit adders in 4-bit adder

    • Etc.

      We can leverage this modularity!

Modular Redundancy

  • If computer has N widgets, add extra widget(s)

  • Then provide:

    • Ability to detect errors

    • Ability to diagnose hard faults (that cause errors)

    • Ability to reconfigure and map in spare widget

  • Cost: 1/N (or 2/N) instead of 2*N for TMR

  • Benefit: can sometimes even be better than TMR!

  • Simplistic example:

    • For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders)

    • Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)

HMR: Hierarchical Modular Redundancy

  • Provide modular redundancy at many levels

    • Processors, adders, multipliers, etc.

  • Engineering issues involved in HMR

    • Allocating resources

    • Managing costs

Allocating Resources

  • For given hardware budget, how to allocate it

  • Which level to allocate spares?

    • Better to have extra processor?

    • Or extra adders in each processor?

    • Or some combination of both?

  • How many spares at each level?

  • Can a spare be mapped in anywhere in system?

Managing Costs

  • Costs: extra modules, wires, and multiplexers

  • Example: 3-bit addition, with module = 1-bit adder





















Current Research Thrust #1

  • Explore modular redundancy within microprocessor

  • Add extra array entries

    • In reorder buffer (ROB), branch history table (BHT), etc.

  • Add extra functional units

    • Adders, multipliers, etc.

  • For error detection

    • Use “DIVA” or redundant threads

  • For hard fault diagnosis

    • Use threshold error counters

  • For reconfiguration

    • Use extra wires and multiplexers

Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004

Current Research Thrust #2

  • Explore modular redundancy within 64-bit adder

  • Start with 64-bit carry lookahead adder (CLA)

    • Hierarchy of 4-bit CLA modules

  • Add 2 extra modules

  • Detect errors as before

  • Diagnose with counters and pattern matching

    • Based on error counter values, can diagnose fault!

  • Reconfigure with clever multiplexing scheme

Conclusions and Future Work

  • Hierarchical Modular Redundancy can provide high reliability at relatively low cost

  • Future directions

    • Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.)

    • Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding)

    • High-level: HMR for chip multiprocessors


Several collaborators on this work

  • Co-Investigator Prof. Sule Ozev (Duke ECE)

  • Fred Bower (Duke CS grad and IBM)

  • Mahmut Yilmaz (Duke ECE grad)

  • Derek Hower (Duke ECE undergrad)

  • Login