ibm s 390 parallel enterprise server g5 fault tolerance a historical perspective n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective PowerPoint Presentation
Download Presentation
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

Loading in 2 Seconds...

play fullscreen
1 / 10

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective. by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz. Some Terms. Concurrent error detection & repair : The system finds errors & repairs itself while still running In-line error checking : EDC, ECC

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective' - asher


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ibm s 390 parallel enterprise server g5 fault tolerance a historical perspective

IBM S/390 Parallel Enterprise Server G5 fault tolerance:A historical perspective

by

L. Spainhower & T.A. Gregg

Presented by

Mahmut Yilmaz

some terms
Some Terms
  • Concurrent error detection & repair: The system finds errors & repairs itself while still running
  • In-line error checking: EDC, ECC
  • On-line error correction: Correct error while the system can still operate
  • Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets
  • Hard faults: Persistent faults that remain active for a significant period of time (forever?)
background
Background
  • S/390 failure modes
    • Permanent, intermittent and transient faults
    • If an error occurs frequently and reaches a threshold  permanent
  • Thermal Conduction Module (TCM)
    • TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber
    • Circuit growth rates exceed reliability gains
    • Parity check and ECC were used
    • Circuits were encapsulated
    • System repair required all system resources
    • Most repairs were concurrent
background cont
Background (cont.)
  • CMOS
    • G1 (1994) to G5
    • G1: Less reliable than 9020
      • System failures are more probable
    • G2: Dynamic memory sparing
    • G3: More robust ECC & CPU sparing (manual replacement)
    • G4: Concurrent CPU sparing & CPU instruction level retry
    • G5: Most reliable
      • Greatly exceeds any TCM
      • Protected good against soft faults (hard faults?)
microprocessor fault tolerant design
Microprocessor Fault Tolerant Design
  • Duplication is used by several systems
    • Intel, Himalaya systems
    • Duplication requires more than 100% hardware overhead
    • Error detection only!
  • Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected
    • S/390 protects
  • Transient fault rates are increasing with decreased feature sizes
microprocessor fault tolerant design cont
Microprocessor Fault Tolerant Design (cont.)
  • G5 Fault Tolerant Design Point
    • 9X2: Main goal is to keep CPI low
    • G5: Main goal is to keep clock period short
    • In-line error protection is not suitable for G5:
      • High fan-out/fan-in
      • Increased chip area
      • Longer wires
      • Increased path length
    • Result: Duplicated I-unit and E-unit
    • A checker like DIVA checker: R-unit
    • Total hardware overhead: 35%
    • No performance penalty (?)
microprocessor fault tolerant design cont1
Microprocessor Fault Tolerant Design (cont.)
  • G5 Fault Tolerant Design Point (cont.)
    • Recovery and on-line repair  R-unit
    • L1: Store-through cache
    • L2: Shared memory
      • Line sparing
    • Up on error detection: If retry is not successful  CPU stopped
    • Dynamic CPU repairing (DCS)
    • Faulty CPU R-unit  Spare CPU R-unit
memory fault tolerance
Memory Fault Tolerance
  • ECC
  • Permanent fault in L1  Cache line or quarter cache delete
  • Permanent fault in L2  Cache delete
    • Data array or address directory marked as invalid
    • Spare lines
  • L3: Main memory
    • Background scrubbing
    • On-line repair: Built-in spare chips
    • Word line or chip kill  After reaching threshold, replace module
i o power cooling subsystem fault tolerance
I/O & Power/Cooling Subsystem Fault Tolerance
  • Multiple paths  Path redundancy
  • Power/Cooling subsystems
questions
Questions
  • Is duplication the optimal choice? No protection against hard faults!
  • How to protect a CPU against intermittent faults? (Delay faults)Generally, they are the beginning phase of a hard fault
  • How to protect ALU by parity check? Adder? (page 868, 1st parag.)
  • If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?