How do we define success? (or, benchmarks and metrics for RADS)

How do we define success?(or, benchmarks and metrics for RADS) Downstairs Group: Amr Awadallah, Aaron Brown, Arnold de Leon, Archana Ganapathi, Kim Keeton, Matthew Merzbacher, Divya Ramachandran, Wei Xu

Approaching the problem • Yardstick for evaluating progress/success • standard: a predefined target that must be met (w/cost) • benchmark: a variable scale of “goodness” • What aspects to measure? • utility of system to end user • adaptability • cost: capital cost, TCO, administrative cost, cost to end users • value: how does RADS improve value to service providers and end users? • Proposed approach: vectors and weighting functions • collect vector of metrics: components of end-user & admin utility • if mapping available, weight components to compute value according to perspective of interest

Evaluation Process • Define raw metrics • aspects of end-user utility • Define mapping to value • weights for reducing utility vector • standards: sets of values representing targets • Create the evaluation environment • requires a specific application context • Develop a perturbation set • define bottom-up: what can go wrong? • categories: failures, security, workload, human, configuration • Apply perturbation set repeatedly • measure initial behavior and adaptability/learning curve • Evaluate management interaction

Some Possible Metrics • End-user • response time, throughput (& stats, histograms) • action-weighted goodput (Gaw) • correctness (relative to gold standard + consistency model) • completeness • restartability/preservation of context across failure • web-specific metrics: user-initiated aborts, clickthrough rate, coverage, abandonment rate • many of these require sophisticated user models & workloads • Operations • complexity of interaction (goal-based?) • transparency of controls and system behavior • validation against human factors design principles

PhD Theses* (*or chapters, at least) • Modeling sophisticated end-users for workload generation • how do users respond to failure events? adapt goals upon unexpected events? • how to abstract into a workload generator? • Measuring administrative complexity and transparency • comprehensibility/predictability of automatic recovery actions • complexity, transparency of manual processes and controls • cleanliness of mental models • Multi-level benchmarking • extrapolating component-level benchmark results to system • Value and cost modeling • Design and implementation of evaluation framework • Validating the benchmark • is it representative of the real world? • assessing resilience of benchmark metrics to changes in environment and perturbations

How do we define success? (or, benchmarks and metrics for RADS)