Understanding and dealing with operator mistakes in internet services
Sponsored Links
This presentation is the property of its rightful owner.
1 / 24

Understanding and dealing with operator mistakes in Internet services PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Understanding and dealing with operator mistakes in Internet services. K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation). Motivation.

Download Presentation

Understanding and dealing with operator mistakes in Internet services

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Understanding and dealing with operator mistakes in Internet services

K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University

OSDI 2003

Vivo Project http://vivo.cs.rutgers.edu

(based on slides from the authors’ OSDI presentation)

Fabián E. Bustamante, Winter 2006


Motivation

  • Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc.

    • Expect 24 x 7 availability, but service outages still happen!

  • A significant number of outages in Internet services are result of operator actions

    1: Architecture is complex

    2: Systems are constantly evolving

    3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation

  • Very little detail on operator mistakes

    • Details strongly guarded by companies and administrators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


This work

  • Understanding: Gather detailed data on operators’ mistakes

    • What categories of mistakes?

    • What’s the impact on the service?

    • How do mistakes correlate with experience, impact?

    • Caveat: this is not a complete study of operator behavior

  • Approaches to deal with operator mistakes: prevention, recovery, automation

  • Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service

    • Like offline testing, but:

      • Virtual environment (extension of online environment)

      • Real workload

      • Migration back and forth with minimal operator involvement

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Contributions

  • Detailed information on operator tasks and mistakes

    • 43 exp. - detailed data on operator behavior inc. 42 mistakes

    • 64% immediately degraded throughput

    • 57% were software configuration mistakes

    • Human experiments are possible and valuable!

  • Designed and prototyped a validation infrastructure

    • Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service

    • 2 techniques to allow operators to validate their actions

  • Demonstrated validation is a promising technique for reducing impact of operator mistakes

    • 66% of all mistakes observed in operator study caught

    • 6/9 mistakes caught in live operator exp. w/ validation

    • Successfully tested with synthetically injected mistakes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Talk outline

  • Approach and contributions

  • Operator study: Understanding the mistakes

    • Representative environment

    • Choice of human subjects and experiments

    • Results

  • Validation: Preventing exposure of mistakes

  • Conclusion and future work

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Multi-tiered Internet services

On-line auction service ~ EBay

Client emulator exercises the service

Web Server

Web Server

Tier 1

Application

Server

Application

Server

Application

Server

Tier 2

Tier 3

Database

Code from the DynaServer project!

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Tasks, operators & training

  • Tasks – two categories

    • Scheduled maintenance tasks (proactive), e.g. upgrade sw

    • Diagnose-and-repair tasks (reactive), e.g. disk failure

  • Operator composition

    • 14 computer science graduate students

    • 5 professional programmers (Ask Jeeves)

    • 2 sysadmins from our department

  • Categorization of operators – w/ filled in questionnaire

    • 11 novices – some familiarity with set up

    • 5 intermediates – experience with a similar service

    • 5 experts - in-charge of a service requiring high uptime

  • Operator training

    • Novice operators given warm-up tasks

    • Material describing service, and detailed steps for tasks

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Experimental setup

  • Service

    • 3-tier auction service, and client emulator from Rice University’s DynaServer Project

    • Loaded at 35% of capacity

  • Machines

    • 2 Web servers (Apache),

    • 5 application servers (Tomcat),

    • 1 database machine (MYSQL)

  • Operator assistance & data capture

    • Monitor service throughput

    • Modified bash shell for command and result trace

  • Manual observation

    • Noting anomalies in operator behavior

    • Bailing out ‘lost’ operators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


First Apache misconfigured and restarted

Second Apache misconfigured and restarted

Application server added

Example trace

  • Task: Add an application server

    • Mistake: Apache misconfiguration

    • Impact: Degraded throughput

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Sampling of other mistakes

  • Adding a new application server

    • Omission of new application server from backend member list

    • Syntax errors, duplicate entries, wrong hostnames

    • Launching the wrong version of software

  • Migrating the database for performance upgrade

    • Incorrect privileges for accessing the database

      • Security vulnerability

    • Database installed on wrong disk

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator mistakes: Category vs. impact

  • 64% of all mistakes had immediate impact on service performance

    • 36% resulted in latent faults

  • Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment

  • Obs. #2: Undetectable latent errors will still require online-recovery techniques

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator mistakes

  • Misconfigurations account for 57% of all errors

    • Config. mistakes spanning multiple components are more likely (global misconfigurations)

  • Obs. #1: Tools to manipulate & check configs are crucial

  • Obs. #2: Careful maintaining multiple versions of s/w

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator categories

  • Experts also made mistakes!

    • Complexity of tasks executed by experts were higher

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Summary of operator study

  • 43 experiments  42 mistakes

  • 27 (64%) mistakes caused immediate impact on service performance

  • 24 (57%) were software configuration mistakes

  • Mistakes were made across all operator categories

  • Trace of operator commands & service performance for all experiments

    • Available at http://vivo.cs.rutgers.edu

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Talk outline

  • Approach and contributions

  • Operator study: Understanding the mistakes

  • Validation: Preventing exposure of mistakes

    • Technique

    • Experimental evaluation

  • Conclusion and future work

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validation of operator’s actions

  • Validation

    • Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients)

    • Correctness is tested by:

      • Migrate the component(s) to virtual sand-box environment,

      • Subject to a real load,

      • Compare behavior to a known correct one, and

    • Migrate back to online environment

  • Types of validation:

    • Replica-based: Compare with online replica (real time)

    • Trace-based: Compare with logged behavior

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Compare

Application State

Database

Compare

Compare

Validating a component: Replica-based

Client Requests

Online slice

Validation slice

Web Server

Web Server

Tier 1

Web ServerProxy

Application

Server

Application

Server

Application

Server

Tier 2

DatabaseProxy

Tier 3

Shunt

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Compare

Web ServerProxy

State

State

Application

Server

Database

DatabaseProxy

Compare

Shunt

Validating a component: Trace-based

Client Requests

Online slice

Validation slice

Web Server

Web Server

Tier 1

Application

Server

Application

Server

Tier 2

Tier 3

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Implementation details

  • Shunting performed in middleware layer

    • Each request tagged with a unique ID all along the request path

  • Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL)

    • Reuse discovery and communication interfaces, common messaging core

  • State management requires well-defined export and import API

    • Stateful servers often support such API

  • Comparator functions to detect errors

    • Simple throughput, flow, and content comparators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validating our prototype: results

  • Live operator experiments

    • Operator given option of type of validation, duration, and to skip validation

    • Validation caught 6 out of 9 mistakes from 8 experiments with validation

  • Mistake-injection experiments

    • Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput)

  • Operator-emulation experiments

    • Operator command scripts derived from the 42 operator mistakes

    • Both trace-based and replica validation caught 22 mistakes

      • Multi-component validation caught 4 latent (component interaction) mistakes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Reduction in impact with validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Fewer mistakes with validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Shunting & buffering overheads

  • Shunting overhead for replica-based validation  39% additional CPU

    • All requests and responses are captured and forwarded to validation slice

    • Trace-based validation is slightly better  32 % additional CPU

    • Overhead is incurred on single component, and only during validation

  • Various optimizations can reduce overhead to 13-22%

    • Examples: response summary (64byte), sampling (session boundaries)

  • Buffering capacity during state check pointing and duplication

    • Required to buffer only about 150 requests for small state sizes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Caveats, limitations & open Issues

  • Non-determinism increases complexity of comparators and proxies

    • E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps

  • Hard state management may require operator intervention

    • Component requires initialization prior to online migration

  • Bootstrapping the validation

    • Validating an intended modification of service behavior – nothing to compare with!

  • How long to validate? What types of validation?

    • Duration spent in validation implies reduced online capacity

  • Future work: Taking validation further…

    • Validate operator actions on databases, network components

    • Combine validation with diagnosis for assisting operators

    • Other validation techniques: Model-based validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


  • Login