1 / 13

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips. Prem Kumar Ramesh Department of Electrical and Computer Engineering. Deep Sub Micron Era. Shrinking Transistors Feature Size < 90 nm Billion Device Processors High Performance ICs Multi-Processor System on Chip.

Download Presentation

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips Prem Kumar Ramesh Department of Electrical and Computer Engineering

  2. Deep Sub Micron Era • Shrinking Transistors • Feature Size < 90 nm • Billion Device Processors • High Performance ICs • Multi-Processor System on Chip

  3. Multi-Processor System on Chip • 10’s of Processors on a single chip • Much more harder than single processor system • Processor Configuration • Communication and Synchronization • Poses a challenge to reliability!

  4. Background – Fault Model • Duration • Transient • Permanent • Location • Processing Element • Network on Chip • Time to Failure • Before-Shelf • After-Shelf • Graceful Degradation

  5. Previous Works • Static Redundancy Approach • N-copies of same program on different PEs • Majority Voting • Not very efficient! • Run-time Recovery Approach • Checker Processor is assigned to each Processor • Checker ‘commits’ only when the result matches with PE • If not, the task gets re-assigned to some other PE

  6. Proposed Work • Extends the run-time recovery approach • Dynamic • Resourse Utilization • Graceful Degradation • Combines two models • Hardware model • Software model

  7. Hardware Model • Dynamically allocate checkers to PE • Commits only when both PEs agree • Detects and Corrects Transient Faults • In case of failure of one, the other could be re-allocated to some other PE, allowing a graceful degradation

  8. Software Model • Addresses Permanent Faults • SPMD-Single Program Multiple Data suits the situation • MPI-based approach • Splitter-Parallel Tasks-Joiner • In case of permanent fault, only the data associated with that task need to be migrated, as all Pes work on same program

  9. Things to Explore Further • MPSoC with Heterogeneous Processors • Simultaneous Multiple Application Processing • Recovering from Control Faults

  10. Simulation Framework • System C to model the Framework • C/C++ for the Application to be mapped

  11. Expected Result • Achieve Run-time Dynamic Fault-Recovery with negligible performance (speed-up) cost • Better Resource Utilization • Achieve graceful degradation

  12. Time Line • First and Second Week • Literature Survey • Third Week • Design of the models • Fourth and Fifth Week • Implementation, Coding and Debugging

  13. References [1] Xinping Zhu and Wei Qin, “Prototyping a Fault-Tolerant Multiprocessor SoC with Run-time Fault Recovery,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [2] Grant Martin, “Overview of the MPSoC Design Challenge,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [3] Peter Flake and Simon Davidmann and Frank Schirrmeister, “System-Level Exploration Tools for MPSoC Designs,” DAC 2006, July 24-28, 2006, San Francisco, California, USA.

More Related