Latency as a performability metric experimental results
1 / 25

Latency as a Performability Metric: Experimental Results - PowerPoint PPT Presentation

  • Uploaded on

Latency as a Performability Metric: Experimental Results. Pete Broadwell [email protected] Outline. Motivation and background Performability overview Project summary Test setup PRESS web server Mendosus fault injection system Experimental results & analysis

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Latency as a Performability Metric: Experimental Results' - maik

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


  • Motivation and background

    • Performability overview

    • Project summary

  • Test setup

    • PRESS web server

    • Mendosus fault injection system

  • Experimental results & analysis

    • How to represent latency

    • Questions for future research

Performability overview
Performability overview

  • Goal of ROC project: develop metrics to evaluate new recovery techniques

  • Performability – class of metrics to describe how a system performs in the presence of faults

    • First used in fault-tolerant computing field1

    • Now being applied to online services

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Example microbenchmark
Example: microbenchmark

RAID disk failure

Project motivation
Project motivation

  • Rutgers study: performability analysis of a web server, using throughput

  • Other studies (esp. from HP Labs Storage group) also use response time as a metric

  • Assertion: latency and data quality are better than throughput for describing user experience

  • How best to represent latency in performability reports?

Project overview
Project overview

  • Goals:

    • Replicate PRESS/Mendosus study with response time measurements

    • Discuss how to incorporate latency into performability statistics

  • Contributions:

    • Provide a latency-based analysis of a web server’s performability (currently rare)

    • Further the development of more comprehensive dependability benchmarks

Experiment components
Experiment components

  • The Mendosus fault injection system

    • From Rutgers (Rich Martin)

    • Goal: low-overhead emulation of a cluster of workstations, injection of likely faults

  • The PRESS web server

    • Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers)

    • Perf-PRESS: basic version

    • HA-PRESS: incorporates hearbeats, master node for automated cluster management

  • Client simulators

    • Submit set # of requests/sec, based on real traces

Mendosus design
Mendosus design

Workstations (real or VMs)

Global Controller






config file





Fault config


User-leveldaemon (Java)


Emulated LAN

Test case timeline
Test case timeline

- Warm-up time: 30-60 seconds

- Time to repair: up to 90 seconds

Simplifying assumptions
Simplifying assumptions

  • Operator repairs any non-transient failure after 90 seconds

  • Web page size is constant

  • Faults are independent

  • Each client request is independent of all others (no sessions!)

    • Request arrival times are determined by a Poisson process (not self-similar)

  • Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs

Sample result app crash
Sample result: app crash





Sample result node hang
Sample result: node hang





Representing latency
Representing latency

  • Total seconds of wait time

    • Not good for comparing cases with different workloads

  • Average (mean) wait time per request

    • OK, but requires that expected (normal) response time be given separately

  • Variance of wait time

    • Not very intuitive to describe. Also, read-only workload means that all variance is toward longer wait times anyway

Representing latency 2
Representing latency (2)

  • Consider “goodput”-based availability: total responses served total requests

  • Idea: Latency-based “punctuality”: ideal total latency actual total latency

  • Like goodput, maximum value is 1

  • “Ideal” total latency:average latency for non-fault cases x total #requests (shouldn’t be 0)

Representing latency 3
Representing latency (3)

  • Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience)

    • Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)

Other metrics
Other metrics

  • Data quality, latency and throughput are interrelated

    • Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”?

  • To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1

    • These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available

1 Zona Research and Keynote Systems, The Need for Speed II, 2001

Sample demerit system






Sample demerit system

  • Rules:

    • Each aborted (2s) conn: 2 demerits

    • Each conn error: 1 demerit

    • Each user timeout (8s): 8 demerits

    • Each sec of total latency above ideal level:(1 demerit/total #requests) x scaling factor

Online service optimization

Cheap, robust& fast (optimal)

Cheap, fast& flaky

Expensive,robust and fast

Cheap &robust, but slow

Expensive,fast & flaky

Expensive &robust, but slow

Cost of operations &components

Online service optimization

Performance metrics:

throughput, latency & data quality


workload & faults


  • Latency-based punctuality and throughput-based availability give similar results for a read-only web workload

  • Applied workload is very important

    • Reliability metrics do not (and should not) reflect maximum performance/workload!

  • Latency did not degrade gracefully in proportion to workload

    • At high loads, PRESS “oscillates” between full service, 100% load shedding

Further work
Further Work

  • Combine test results & predicted component failure rates to get long-term performability estimates (are these useful?)

  • Further study will benefit from more sophisticated client & workload simulators

  • Services that generate dynamic content should lead to more interesting data (ex: RUBiS)

Example long term model
Example: long-term model

Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1

D=number of data disks

pi(t)=probability that system is in state i at time t

wi(t) =reward (disk I/O operations/sec)

m = disk repair rate

l = failure rate of a single disk drive

1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997