latency as a performability metric experimental results n.
Skip this Video
Download Presentation
Latency as a Performability Metric: Experimental Results

Loading in 2 Seconds...

play fullscreen
1 / 25

Latency as a Performability Metric: Experimental Results - PowerPoint PPT Presentation

  • Uploaded on

Latency as a Performability Metric: Experimental Results. Pete Broadwell Outline. Motivation and background Performability overview Project summary Test setup PRESS web server Mendosus fault injection system Experimental results & analysis

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Latency as a Performability Metric: Experimental Results' - maik

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Motivation and background
    • Performability overview
    • Project summary
  • Test setup
    • PRESS web server
    • Mendosus fault injection system
  • Experimental results & analysis
    • How to represent latency
    • Questions for future research
performability overview
Performability overview
  • Goal of ROC project: develop metrics to evaluate new recovery techniques
  • Performability – class of metrics to describe how a system performs in the presence of faults
    • First used in fault-tolerant computing field1
    • Now being applied to online services

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

example microbenchmark
Example: microbenchmark

RAID disk failure

project motivation
Project motivation
  • Rutgers study: performability analysis of a web server, using throughput
  • Other studies (esp. from HP Labs Storage group) also use response time as a metric
  • Assertion: latency and data quality are better than throughput for describing user experience
  • How best to represent latency in performability reports?
project overview
Project overview
  • Goals:
    • Replicate PRESS/Mendosus study with response time measurements
    • Discuss how to incorporate latency into performability statistics
  • Contributions:
    • Provide a latency-based analysis of a web server’s performability (currently rare)
    • Further the development of more comprehensive dependability benchmarks
experiment components
Experiment components
  • The Mendosus fault injection system
    • From Rutgers (Rich Martin)
    • Goal: low-overhead emulation of a cluster of workstations, injection of likely faults
  • The PRESS web server
    • Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers)
    • Perf-PRESS: basic version
    • HA-PRESS: incorporates hearbeats, master node for automated cluster management
  • Client simulators
    • Submit set # of requests/sec, based on real traces
mendosus design
Mendosus design

Workstations (real or VMs)

Global Controller






config file





Fault config


User-leveldaemon (Java)


Emulated LAN

test case timeline
Test case timeline

- Warm-up time: 30-60 seconds

- Time to repair: up to 90 seconds

simplifying assumptions
Simplifying assumptions
  • Operator repairs any non-transient failure after 90 seconds
  • Web page size is constant
  • Faults are independent
  • Each client request is independent of all others (no sessions!)
    • Request arrival times are determined by a Poisson process (not self-similar)
  • Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs
sample result app crash
Sample result: app crash





sample result node hang
Sample result: node hang





representing latency
Representing latency
  • Total seconds of wait time
    • Not good for comparing cases with different workloads
  • Average (mean) wait time per request
    • OK, but requires that expected (normal) response time be given separately
  • Variance of wait time
    • Not very intuitive to describe. Also, read-only workload means that all variance is toward longer wait times anyway
representing latency 2
Representing latency (2)
  • Consider “goodput”-based availability: total responses served total requests
  • Idea: Latency-based “punctuality”: ideal total latency actual total latency
  • Like goodput, maximum value is 1
  • “Ideal” total latency:average latency for non-fault cases x total #requests (shouldn’t be 0)
representing latency 3
Representing latency (3)
  • Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience)
    • Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)
other metrics
Other metrics
  • Data quality, latency and throughput are interrelated
    • Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”?
  • To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1
    • These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available

1 Zona Research and Keynote Systems, The Need for Speed II, 2001

sample demerit system






Sample demerit system
  • Rules:
    • Each aborted (2s) conn: 2 demerits
    • Each conn error: 1 demerit
    • Each user timeout (8s): 8 demerits
    • Each sec of total latency above ideal level:(1 demerit/total #requests) x scaling factor
online service optimization

Cheap, robust& fast (optimal)

Cheap, fast& flaky

Expensive,robust and fast

Cheap &robust, but slow

Expensive,fast & flaky

Expensive &robust, but slow

Cost of operations &components

Online service optimization

Performance metrics:

throughput, latency & data quality


workload & faults

  • Latency-based punctuality and throughput-based availability give similar results for a read-only web workload
  • Applied workload is very important
    • Reliability metrics do not (and should not) reflect maximum performance/workload!
  • Latency did not degrade gracefully in proportion to workload
    • At high loads, PRESS “oscillates” between full service, 100% load shedding
further work
Further Work
  • Combine test results & predicted component failure rates to get long-term performability estimates (are these useful?)
  • Further study will benefit from more sophisticated client & workload simulators
  • Services that generate dynamic content should lead to more interesting data (ex: RUBiS)
example long term model
Example: long-term model

Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1

D=number of data disks

pi(t)=probability that system is in state i at time t

wi(t) =reward (disk I/O operations/sec)

m = disk repair rate

l = failure rate of a single disk drive

1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997