1 / 6

Past High Availability Standards Efforts

Past High Availability Standards Efforts. Jim Gray Microsoft http://Research.Microsoft.com/~Gray/Talks/. MTTR is the key metric. Availability = MTTR/(MTBF+MTTR) UN-availability How to make things twice as good: Double MTBF or Cut MTTR in 1/2. The NASDAQ Benchmark.

lesat
Download Presentation

Past High Availability Standards Efforts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Past High Availability Standards Efforts Jim Gray Microsoft http://Research.Microsoft.com/~Gray/Talks/

  2. MTTR is the key metric • Availability = MTTR/(MTBF+MTTR) • UN-availability • How to make things twice as good: • Double MTBF or • Cut MTTR in 1/2

  3. The NASDAQ Benchmark • 1994-1995: Unisys, IBM, DEC, Tandem,… • Wanted 5 9’s • Two data centers (hot standby) • Inject faults at a node and measure repair time. (cpu, disk, controller, NIC, OS, DB, …)required 1-10 second repair • Fail site, watch standby take over: required 1 minute repair (transparent to client)

  4. RAID Advisory Boardhttp://www.raid-advisory.com/ • Failure Resistant Disk System(FRDS) • Repair • Failure Tolerant Disk System (FTDS) • Mask • Disaster Tolerant Disk System (DTDS) • 1 km separation • Array Controller (FRAC,FTAC, DTAC) • NO SPECIFIC TIME OR COST METRICS

  5. EDAP Classification Criteria for Disk Systems 1) Protection Against Data Loss And Loss of Access To Data Due To Disk Failure. 2) Reconstruction Of Failed Disk Contents To A Replacement Disk. 3) Protection Against Data Loss Due To A "Write Hole". 4) Protection Against Data Loss Due To Attached Equipment Failures. 5) Protection Against Data Loss Due To Component Failure. 6) FRU Monitoring And Failure Indication. 7) Disk Hot Swap. 8) Protection Against Data Loss Due To Cache Component Failure. 9) Protection Against Data Loss Due To External Power Failure. 10) Protection Against Data Loss Due To A Temperature Out Of Operating Range Condition. 11) Component And Environmental Failure Warning. 12) Protection Against Loss Of Access To Data Due To Component Failure, Excluding Cache. 13) Protection Against Loss Of Access To Data Due To Cache Component Failure. 14) Protection Against Loss Of Access To Data Due To Attached Equipment Failures. 15) Protection Against Loss Of Access To Data Due To External Power Failure 16) Protection Against Loss Of Data Access Due To FRU Replacement. 17) Disk Hot Spare. 18) Protection Against Data Loss And Loss Of Access To Data Due To Multiple Disk Failures In An FTDS+. 19) Protection Against Loss Of Data Access Due To Zone Failure. 20) Long Distance Protection Against Loss Of Data Access Due To Zone Failure.

  6. TPC effort (1997) • Started by John Kemeny & Dean Brock of Data General • Idea: start with a TPC-C system (benchmark it). • Then bullet-proof it (RAID, power, backup/restore, geoplex) • Then re-measure performance & price-performance • Backup time and impact on online load • Recovery time for various events (Fail disks, cpus, ctlrs, nics) • Fail site (disaster recovery with symmetric/asymmetric standby) • Online Change • Upgrade software • Add /replace hardware (cpu, memory, nic, ctlr, disk, tape)

More Related