Autonomic computing via dynamic self repair
Download
1 / 12

Autonomic Computing via Dynamic Self-Repair - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Autonomic Computing via Dynamic Self-Repair. Daniel J. Sorin Department of Electrical & Computer Engineering Duke University. A Computing Challenge for NASA. NASA relies on computers NASA is much more demanding than most users Must operate in harsh environments that cause hard faults

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Autonomic Computing via Dynamic Self-Repair' - darryl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Autonomic computing via dynamic self repair

Autonomic Computing via Dynamic Self-Repair

Daniel J. Sorin

Department of Electrical & Computer Engineering

Duke University


A computing challenge for nasa
A Computing Challenge for NASA

  • NASA relies on computers

  • NASA is much more demanding than most users

    • Must operate in harsh environments that cause hard faults

    • Must operate correctly for years

    • Must not require human to repair problems

  • Our goal

    • Designing autonomic computer systems

    • Permanent faults will occur and computer will handle them


But isn t this a solved problem
But Isn’t This a Solved Problem?

  • We could just use TMR (triple modular redundancy)

CPU

voter

CPU

output

CPU

  • But too much power usage to be feasible

    • Especially for modern microprocessors


Key observation
Key Observation

  • Computer hardware is already modular

    • Improves performance

    • Simplifies design and verification

  • Modular exists at many levels

    • Multiple processors per chip (CMP)

    • Multiple thread contexts per processor

    • Multiple functional units (e.g., adders) per processor

    • Multiple 4-bit adders in 64-bit adder

    • Multiple 1-bit adders in 4-bit adder

    • Etc.

      We can leverage this modularity!


Modular redundancy
Modular Redundancy

  • If computer has N widgets, add extra widget(s)

  • Then provide:

    • Ability to detect errors

    • Ability to diagnose hard faults (that cause errors)

    • Ability to reconfigure and map in spare widget

  • Cost: 1/N (or 2/N) instead of 2*N for TMR

  • Benefit: can sometimes even be better than TMR!

  • Simplistic example:

    • For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders)

    • Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)


Hmr hierarchical modular redundancy
HMR: Hierarchical Modular Redundancy

  • Provide modular redundancy at many levels

    • Processors, adders, multipliers, etc.

  • Engineering issues involved in HMR

    • Allocating resources

    • Managing costs


Allocating resources
Allocating Resources

  • For given hardware budget, how to allocate it

  • Which level to allocate spares?

    • Better to have extra processor?

    • Or extra adders in each processor?

    • Or some combination of both?

  • How many spares at each level?

  • Can a spare be mapped in anywhere in system?


Managing costs
Managing Costs

  • Costs: extra modules, wires, and multiplexers

  • Example: 3-bit addition, with module = 1-bit adder

A1

adder

C1

mux

B1

mux

adder

A2

mux

C2

mux

B2

mux

adder

mux

C3

mux

A3

adder

B3


Current research thrust 1
Current Research Thrust #1

  • Explore modular redundancy within microprocessor

  • Add extra array entries

    • In reorder buffer (ROB), branch history table (BHT), etc.

  • Add extra functional units

    • Adders, multipliers, etc.

  • For error detection

    • Use “DIVA” or redundant threads

  • For hard fault diagnosis

    • Use threshold error counters

  • For reconfiguration

    • Use extra wires and multiplexers

Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004


Current research thrust 2
Current Research Thrust #2

  • Explore modular redundancy within 64-bit adder

  • Start with 64-bit carry lookahead adder (CLA)

    • Hierarchy of 4-bit CLA modules

  • Add 2 extra modules

  • Detect errors as before

  • Diagnose with counters and pattern matching

    • Based on error counter values, can diagnose fault!

  • Reconfigure with clever multiplexing scheme


Conclusions and future work
Conclusions and Future Work

  • Hierarchical Modular Redundancy can provide high reliability at relatively low cost

  • Future directions

    • Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.)

    • Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding)

    • High-level: HMR for chip multiprocessors


Acknowledgments
Acknowledgments

Several collaborators on this work

  • Co-Investigator Prof. Sule Ozev (Duke ECE)

  • Fred Bower (Duke CS grad and IBM)

  • Mahmut Yilmaz (Duke ECE grad)

  • Derek Hower (Duke ECE undergrad)


ad