slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID System PowerPoint Presentation
Download Presentation
Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID System

Loading in 2 Seconds...

play fullscreen
1 / 31

Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID System - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID Systems. Aaron Brown 2000 Winter IRAM/ISTORE Retreat. Outline. Motivation: why new benchmarks, why AME? Availability benchmarks: a general approach

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID System


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME)A Case Study of Software RAID Systems

Aaron Brown

2000 Winter IRAM/ISTORE Retreat

outline
Outline
  • Motivation: why new benchmarks, why AME?
  • Availability benchmarks: a general approach
  • Case study: availability of Software RAID
  • Conclusions and future directions
why benchmark ame
Why benchmark AME?
  • Most benchmarks measure performance
  • Today’s most pressing problems aren’t about performance
    • online service providers face new problems as they rapidly deploy, alter, and expand Internet services
      • providing 24x7 availability (A)
      • managing system maintenance (M)
      • handling evolutionary growth (E)
    • AME issues affect in-house providers and smaller installations as well
      • even UCB CS department
why benchmarks
Why benchmarks?
  • Growing consensus that we need to focus on these problems in the research community
    • HotOS-7: “No-Futz Computing”
    • Hennessy at FCRC:
      • “other qualities [than performance] will become crucial: availability, maintainability, scalability”
      • “if access to services on servers is the killer app--availability is the key metric”
    • Lampson at SOSP ’99:
      • big challenge for systems research is building “systems that work: meet specifications, are always available, evolve while they run, grow without practical limits”
  • Benchmarks shape a field!
    • they define what we can measure
the components of ame
The components of AME
  • Availability
    • what factors affect the quality of service delivered by the system, and by how much/how long?
    • how well can systems survive typical failure scenarios?
  • Maintainability
    • how hard is it to keep a system running at a certain level of availability and performance?
    • how complex are maintenance tasks like upgrades, repair, tuning, troubleshooting, disaster recovery?
  • Evolutionary Growth
    • can the system be grown incrementally with heterogeneous components?
    • what’s the impact of such growth on availability, maintainability, and sustained performance?
outline1
Outline
  • Motivation: why new benchmarks, why AME?
  • Availability benchmarks: a general approach
  • Case study: availability of Software RAID
  • Conclusions and future directions
how can we measure availability
How can we measure availability?
  • Traditionally, percentage of time system is up
    • time-averaged, binary view of system state (up/down)
  • Traditional metric is too inflexible
    • doesn’t capture degraded states
      • a non-binary spectrum between “up” and “down”
    • time-averaging discards important temporal behavior
      • compare 2 systems with 96.7% traditional availability:
        • system A is down for 2 seconds per minute
        • system B is down for 1 day per month
  • Solution: measure variation in system quality of service metrics over time
example quality of service metrics
Example Quality of Service metrics
  • Performance
    • e.g., user-perceived latency, server throughput
  • Degree of fault-tolerance
  • Completeness
    • e.g., how much of relevant data is used to answer query
  • Accuracy
    • e.g., of a computation or decoding/encoding process
  • Capacity
    • e.g., admission control limits, access to non-essential services
availability benchmark methodology
Availability benchmark methodology
  • Goal: quantify variation in QoS metrics as events occur that affect system availability
  • Leverage existing performance benchmarks
    • to generate fair workloads
    • to measure & trace quality of service metrics
  • Use fault injection to compromise system
    • hardware faults (disk, memory, network, power)
    • software faults (corrupt input, driver error returns)
    • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
    • the availability analogues of performance micro- and macro-benchmarks
methodology reporting results
Methodology: reporting results
  • Results are most accessible graphically
    • plot change in QoS metrics over time
    • compare to “normal” behavior
      • 99% confidence intervals calculated from no-fault runs
  • Graphs can be distilled into numbers
    • quantify distribution of deviations from normal behavior, compute area under curve for deviations, ...
outline2
Outline
  • Motivation: why new benchmarks, why AME?
  • Availability benchmarks: a general approach
  • Case study: availability of Software RAID
  • Conclusions and future directions
case study
Case study
  • Software RAID-5 plus web server
    • Linux/Apache vs. Windows 2000/IIS
  • Why software RAID?
    • well-defined availability guarantees
      • RAID-5 volume should tolerate a single disk failure
      • reduced performance (degraded mode) after failure
      • may automatically rebuild redundancy onto spare disk
    • simple system
    • easy to inject storage faults
  • Why web server?
    • an application with measurable QoS metrics that depend on RAID availability and performance
benchmark environment metrics
Benchmark environment: metrics
  • QoS metrics measured
    • hits per second
      • roughly tracks response time in our experiments
    • degree of fault tolerance in storage system
  • Workload generator and data collector
    • SpecWeb99 web benchmark
      • simulates realistic high-volume user load
      • mostly static read-only workload; some dynamic content
      • modified to run continuously and to measure average hits per second over each 2-minute interval
benchmark environment faults
Benchmark environment: faults
  • Focus on faults in the storage system (disks)
  • How do disks fail?
    • according to Tertiary Disk project, failures include:
      • recovered media errors
      • uncorrectable write failures
      • hardware errors (e.g., diagnostic failures)
      • SCSI timeouts
      • SCSI parity errors
    • note: no head crashes, no fail-stop failures
disk fault injection technique
Disk fault injection technique
  • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk
    • a PC that appears as a disk on the SCSI bus
    • I/O requests processed in software, reflected to local disk
    • fault injection performed by altering SCSI command processing in the emulation software
  • Types of emulated faults:
    • media errors (transient, correctable, uncorrectable)
    • hardware errors (firmware, mechanical)
    • parity errors
    • power failures
    • disk hangs/timeouts
system configuration

IBM18 GB10k RPM

Server

Disk Emulator

IDEsystemdisk

SCSIsystemdisk

Adaptec2940

UltraSCSI

EmulatedDisk

Adaptec2940

IBM18 GB10k RPM

emulatorbacking disk(NTFS)

IBM18 GB10k RPM

Adaptec2940

Adaptec2940

Adaptec2940

AdvStorASC-U2W

IBM18 GB10k RPM

EmulatedSpareDisk

AMD K6-2-33364 MB DRAMLinux or Win2000

AMD K6-2-350Windows NT 4.0ASC VirtualSCSI lib.

RAIDdata disks

= Fast/Wide SCSI bus, 20 MB/sec

System configuration
  • RAID-5 Volume: 3GB capacity, 1GB used per disk
    • 3 physical disks, 1 emulated disk, 1 emulated spare disk
  • 2 web clients connected via 100Mb switched Ethernet
results single fault experiments
Results: single-fault experiments
  • One exp’t for each type of fault (15 total)
    • only one fault injected per experiment
    • no human intervention
    • system allowed to continue until stabilized or crashed
  • Four distinct system behaviors observed

(A) no effect: system ignores fault

(B) RAID system enters degraded mode

(C) RAID system begins reconstruction onto spare disk

(D) system failure (hang or crash)

system behavior single fault
System behavior: single-fault

(A) no effect

(B) enter degraded mode

(C) begin reconstruction

(D) system failure

system behavior single fault 2
System behavior: single-fault (2)
  • Windows ignores benign faults
  • Windows can’t automatically rebuild
  • Linux reconstructs on all errors
  • Both systems fail when disk hangs
interpretation single fault exp ts
Interpretation: single-fault exp’ts
  • Linux and Windows take opposite approaches to managing benign and transient faults
    • these faults do not necessarily imply a failing disk
      • Tertiary Disk: 368/368 disks had transient SCSI errors; 13/368 disks had transient hardware errors, only 2/368 needed replacing.
    • Linux is paranoid and stops using a disk on any error
      • fragile: system is more vulnerable to multiple faults
      • but no chance of slowly-failing disk impacting perf.
    • Windows ignores most benign/transient faults
      • robust: less likely to lose data, more disk-efficient
      • less likely to catch slowly-failing disks and remove them
  • Neither policy is ideal!
    • need a hybrid?
results multiple fault experiments
Results: multiple-fault experiments
  • Scenario

(1) disk fails

(2) data is reconstructed onto spare

(3) spare fails

(4) administrator replaces both failed disks

(5) data is reconstructed onto new disks

  • Requires human intervention
    • to initiate reconstruction on Windows 2000
      • simulate 6 minute sysadmin response time
    • to replace disks
      • simulate 90 seconds of time to replace hot-swap disks
system behavior multiple fault
System behavior: multiple-fault

Windows

2000/IIS

Linux/

Apache

  • Windows reconstructs ~3x faster than Linux
  • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not
interpretation multi fault exp ts
Interpretation: multi-fault exp’ts
  • Linux and Windows have different reconstruction philosophies
    • Linux uses idle bandwidth for reconstruction
      • little impact on application performance
      • increases length of time system is vulnerable to faults
    • Windows steals app. bandwidth for reconstruction
      • reduces application performance
      • minimizes system vulnerability
      • but must be manually initiated (or scripted)
    • Windows favors fault-tolerance over performance; Linux favors performance over fault-tolerance
      • the same design philosophies seen in the single-fault experiments
maintainability observations
Maintainability Observations
  • Scenario: administrator accidentally removes and replaces live disk in degraded mode
    • double failure; no guarantee on data integrity
    • theoretically, can recover if writes are queued
    • Windows recovers, but loses active writes
      • journalling NTFS is not corrupted
      • all data not being actively written is intact
    • Linux will not allow removed disk to be reintegrated
      • total loss of all data on RAID volume!
maintainability observations 2
Maintainability Observations (2)
  • Scenario: administrator adds a new spare
    • a common task that can be done with hot-swap drive trays
    • Linux requires a reboot for the disk to be recognized
    • Windows can dynamically detect the new disk
  • Windows 2000 RAID is easier to maintain
    • easier GUI configuration
    • more flexible in adding disks
    • SCSI rescan and NTFS deal with administrator goofs
    • less likely to require administration due to transient errors
      • BUT must manually initiate reconstruction when needed
outline3
Outline
  • Motivation: why new benchmarks, why AME?
  • Availability benchmarks: a general approach
  • Case study: availability of Software RAID
  • Conclusions and future directions
conclusions
Conclusions
  • AME benchmarks are needed to direct research toward today’s pressing problems
  • Our availability benchmark methodology is powerful
    • revealed undocumented design decisions in Linux and Windows 2000 software RAID implementations
      • transient error handling
      • reconstruction priorities
  • Windows & Linux SW RAIDs are imperfect
    • but Windows is easier to maintain, less likely to fail due to double faults, and less likely to waste disks
    • if spares are cheap and plentiful, Linux auto-reconstruction gives it the edge
future directions
Future Directions
  • Add maintainability
    • use methodology from availability benchmark
    • but include administrator’s response to faults
    • must develop model of typical administrator behavior
    • can we quantify administrative work needed to maintain a certain level of availability?
  • Expand availability benchmarks
    • apply to other systems: DBMSs, mail, distributed apps
    • use ISTORE-1 prototype
      • 80-node x86 cluster with built-in fault injection, diags.
  • Add evolutionary growth
    • just extend maintainability/availability techniques?
availability example sw raid
Availability Example: SW RAID
  • Win2k/IIS, Linux/Apache on software RAID-5 volumes

Windows

2000/IIS

Linux/

Apache

  • Windows gives more bandwidth to reconstruction, minimizing fault vulnerability at cost of app performance
    • compare to Linux, which does the opposite
system behavior multiple fault1

(1) data disk faulted

(2) reconstruction

(3) spare faulted

(4) disks replaced

(5) reconstruction

System behavior: multiple-fault

Windows

2000/IIS

Linux/

Apache

  • Windows reconstructs ~3x faster than Linux
  • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not