1 / 38

Taxonomy and Trends

Taxonomy and Trends. Dan Siewiorek Carnegie Mellon University June 2012. Outline. Taxonomy and Trends General Purpose Examples High Availability Examples A Methodology Conclusion. Application Taxonomy. General purpose Wide range of applications; frequently high performance

alida
Download Presentation

Taxonomy and Trends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012

  2. Outline • Taxonomy and Trends • General Purpose Examples • High Availability Examples • A Methodology • Conclusion

  3. Application Taxonomy • General purpose • Wide range of applications; frequently high performance • High availability • Occasional loss of single user but not system; rapid restart • Long life • No human maintenance; automatically detect and reconfigure; high coverage • Critical computations • Usually real-time control systems; low recovery time; high coverage

  4. General Purpose Examples

  5. Error Detection Techniques in Typical General-Purpose System • Memory • Double-error-detection code on memory data • Parity on address and control information • Cache • Parity on data, address, control information • I/O Unit • Parity on data and control • CPU • Parity on data paths • Parity on control store • Duplication and comparison of control logic

  6. Error Recovery Techniques in Typical General-Purpose System • Memory • Single-error-detection code on data • Retry on address or control information parity error • Cache • Retry on data, address, control information parity error • I/O Unit • Retry on data or control parity errors • CPU • Retry on control store parity error • Invert sense of control store • Macroinstruction retry

  7. IBM 3090 Series Fault-Tolerance Features • Reliability • Low intrinsic failure rate technology • Extensive component burn-in during manufacture • Dual processor controller that incorporates switchover • Dual 3370 Direct Access Storage units support switchover • Multiple consoles for monitoring processor activity and for backup • LSI packaging vastly reduces number of circuit connections • Internal machine power and temperature monitoring • Chip sparing in memory replaces defective chips automatically

  8. IBM 3090 Series Fault-Tolerance Features • Availability • Two or four central processors • Automatic error detection and correction in central and expanded storage • Single bit error correction and double bit error detection in central storage • Double bit error correction and triple bit error detection in expanded storage • Storage deallocation in 4K-byte increments under system program control • Ability to vary channels off line in one channel increments • Instruction retry • Channel command retry • Error detection and fault isolation circuits provide improved recovery and serviceability • Multipath I/O controllers and units

  9. IBM 3090 Series Fault-Tolerance Features • Data integrity • Key controlled storage protection (store and fetch) • Critical address storage protection • Storage error checking and correction • Processor cache error handling • Parity and other internal error checking • Segment protection (S/370 mode) • Page protection (S/370 mode) • Clear reset of registers and main storage • Automatic Remote Support authorization • Block multiplexer channel command retry • Extensive I/O recovery by hardware and control programs

  10. IBM 3090 Series Fault-Tolerance Features • Serviceability • Automatic fault isolation (analysis routines) concurrent with operation • Automatic remote support capability – auto call to IBM if authorized by the customer • Automatic customer engineer and parts dispatching • Trade facilities • Error logout recording • Microcode update distribution via remote support facilities • Remote service console capability • Automatic validation tests after repair • Customer problem analysis facilities

  11. ED/FI in IBM 308X / 3090 • Hundreds of thousands of isolation domains • Parity checks account for 70-80% of checkers – data, address, and shift/increment parity predictors • Decoder/encoder checkers • 25% of IBM 3090 circuits for RAS • Can instantaneously detect 90% of all errors • 25% of faults assumed solid for the technology • If less that two weeks between events, the cause is assumed to be the same intermittent • Call service if 24 errors in 2 hours

  12. High Availability Examples

  13. Tandem Design Objectives • “Nonstop” operation where failures detected, components configured out of service, repaired components configured back in without stopping other system components • No single hardware failure can compromise data integrity of the system • Modular system expansion through adding more processing power, memory, and peripherals without impacting application software

  14. Fault Containment • Software processes do not share state – only message passing • Hardware – no shared memory, dual porting I/O, multiple power supply

  15. Fast-Fail Modules (detection) • Software – consistency checks, defensive programming • Hardware – software generated status probes, hardware self-tests

  16. Software Bugs • Backup process does not encounter same state and environment, code takes a different path

  17. Software • Process pairs • Transaction processing – two phase commit protocol • Log write-ahead protocol – record before and after-image of database in an audit trail • Network systems management – programmed operators help reduce administrative errors • Tandem maintenance and diagnostic system – analyze event loss to successfully call out FRU 90% of time

  18. Error Handling • Error detection logic records error • Operating system runs diagnostics • Incident of failure algorithm • If transient return board to service • If permanent call Customer Assistant Center – CAC • CAC determines problem • Selects board of same revision level • Print installation instructions • Ship via overnight courier • 22 field engineers support 400 systems • Service 6% / year of LCC vs. 9% for others

  19. A Methodology

  20. A Methodology • Define objectives • Limit the scope • Define confinement regions • Design error handling mechanisms • Design error reporting mechanisms • Testing of error handling/reporting mechanisms • Evaluate design

  21. Exercising Latent Faults *Special commands to support exercising dormant areas are provided in BIUs and MCUs

  22. Recovery Mechanisms and Coverage

  23. Conclusion

  24. Conclusion • Designing from first principles to produce an architecture to tolerate failures achieves better reliability, availability, and cost-effectiveness than an ad-hoc, add-on approach • It is possible to build systems in which the activities of fault detection, diagnosis, and recovery are completely automated and transparent to the user

More Related