Understanding and dealing with operator mistakes in internet services
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Understanding and dealing with operator mistakes in Internet services PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Understanding and dealing with operator mistakes in Internet services. K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation). Motivation.

Download Presentation

Understanding and dealing with operator mistakes in Internet services

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Understanding and dealing with operator mistakes in internet services

Understanding and dealing with operator mistakes in Internet services

K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University

OSDI 2003

Vivo Project http://vivo.cs.rutgers.edu

(based on slides from the authors’ OSDI presentation)

Fabián E. Bustamante, Winter 2006


Motivation

Motivation

  • Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc.

    • Expect 24 x 7 availability, but service outages still happen!

  • A significant number of outages in Internet services are result of operator actions

    1: Architecture is complex

    2: Systems are constantly evolving

    3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation

  • Very little detail on operator mistakes

    • Details strongly guarded by companies and administrators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


This work

This work

  • Understanding: Gather detailed data on operators’ mistakes

    • What categories of mistakes?

    • What’s the impact on the service?

    • How do mistakes correlate with experience, impact?

    • Caveat: this is not a complete study of operator behavior

  • Approaches to deal with operator mistakes: prevention, recovery, automation

  • Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service

    • Like offline testing, but:

      • Virtual environment (extension of online environment)

      • Real workload

      • Migration back and forth with minimal operator involvement

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Contributions

Contributions

  • Detailed information on operator tasks and mistakes

    • 43 exp. - detailed data on operator behavior inc. 42 mistakes

    • 64% immediately degraded throughput

    • 57% were software configuration mistakes

    • Human experiments are possible and valuable!

  • Designed and prototyped a validation infrastructure

    • Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service

    • 2 techniques to allow operators to validate their actions

  • Demonstrated validation is a promising technique for reducing impact of operator mistakes

    • 66% of all mistakes observed in operator study caught

    • 6/9 mistakes caught in live operator exp. w/ validation

    • Successfully tested with synthetically injected mistakes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Talk outline

Talk outline

  • Approach and contributions

  • Operator study: Understanding the mistakes

    • Representative environment

    • Choice of human subjects and experiments

    • Results

  • Validation: Preventing exposure of mistakes

  • Conclusion and future work

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Multi tiered internet services

Multi-tiered Internet services

On-line auction service ~ EBay

Client emulator exercises the service

Web Server

Web Server

Tier 1

Application

Server

Application

Server

Application

Server

Tier 2

Tier 3

Database

Code from the DynaServer project!

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Tasks operators training

Tasks, operators & training

  • Tasks – two categories

    • Scheduled maintenance tasks (proactive), e.g. upgrade sw

    • Diagnose-and-repair tasks (reactive), e.g. disk failure

  • Operator composition

    • 14 computer science graduate students

    • 5 professional programmers (Ask Jeeves)

    • 2 sysadmins from our department

  • Categorization of operators – w/ filled in questionnaire

    • 11 novices – some familiarity with set up

    • 5 intermediates – experience with a similar service

    • 5 experts - in-charge of a service requiring high uptime

  • Operator training

    • Novice operators given warm-up tasks

    • Material describing service, and detailed steps for tasks

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Experimental setup

Experimental setup

  • Service

    • 3-tier auction service, and client emulator from Rice University’s DynaServer Project

    • Loaded at 35% of capacity

  • Machines

    • 2 Web servers (Apache),

    • 5 application servers (Tomcat),

    • 1 database machine (MYSQL)

  • Operator assistance & data capture

    • Monitor service throughput

    • Modified bash shell for command and result trace

  • Manual observation

    • Noting anomalies in operator behavior

    • Bailing out ‘lost’ operators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Example trace

First Apache misconfigured and restarted

Second Apache misconfigured and restarted

Application server added

Example trace

  • Task: Add an application server

    • Mistake: Apache misconfiguration

    • Impact: Degraded throughput

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Sampling of other mistakes

Sampling of other mistakes

  • Adding a new application server

    • Omission of new application server from backend member list

    • Syntax errors, duplicate entries, wrong hostnames

    • Launching the wrong version of software

  • Migrating the database for performance upgrade

    • Incorrect privileges for accessing the database

      • Security vulnerability

    • Database installed on wrong disk

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator mistakes category vs impact

Operator mistakes: Category vs. impact

  • 64% of all mistakes had immediate impact on service performance

    • 36% resulted in latent faults

  • Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment

  • Obs. #2: Undetectable latent errors will still require online-recovery techniques

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator mistakes

Operator mistakes

  • Misconfigurations account for 57% of all errors

    • Config. mistakes spanning multiple components are more likely (global misconfigurations)

  • Obs. #1: Tools to manipulate & check configs are crucial

  • Obs. #2: Careful maintaining multiple versions of s/w

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Operator categories

Operator categories

  • Experts also made mistakes!

    • Complexity of tasks executed by experts were higher

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Summary of operator study

Summary of operator study

  • 43 experiments  42 mistakes

  • 27 (64%) mistakes caused immediate impact on service performance

  • 24 (57%) were software configuration mistakes

  • Mistakes were made across all operator categories

  • Trace of operator commands & service performance for all experiments

    • Available at http://vivo.cs.rutgers.edu

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Talk outline1

Talk outline

  • Approach and contributions

  • Operator study: Understanding the mistakes

  • Validation: Preventing exposure of mistakes

    • Technique

    • Experimental evaluation

  • Conclusion and future work

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validation of operator s actions

Validation of operator’s actions

  • Validation

    • Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients)

    • Correctness is tested by:

      • Migrate the component(s) to virtual sand-box environment,

      • Subject to a real load,

      • Compare behavior to a known correct one, and

    • Migrate back to online environment

  • Types of validation:

    • Replica-based: Compare with online replica (real time)

    • Trace-based: Compare with logged behavior

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validating a component replica based

Compare

Application State

Database

Compare

Compare

Validating a component: Replica-based

Client Requests

Online slice

Validation slice

Web Server

Web Server

Tier 1

Web ServerProxy

Application

Server

Application

Server

Application

Server

Tier 2

DatabaseProxy

Tier 3

Shunt

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validating a component trace based

Compare

Web ServerProxy

State

State

Application

Server

Database

DatabaseProxy

Compare

Shunt

Validating a component: Trace-based

Client Requests

Online slice

Validation slice

Web Server

Web Server

Tier 1

Application

Server

Application

Server

Tier 2

Tier 3

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Implementation details

Implementation details

  • Shunting performed in middleware layer

    • Each request tagged with a unique ID all along the request path

  • Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL)

    • Reuse discovery and communication interfaces, common messaging core

  • State management requires well-defined export and import API

    • Stateful servers often support such API

  • Comparator functions to detect errors

    • Simple throughput, flow, and content comparators

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Validating our prototype results

Validating our prototype: results

  • Live operator experiments

    • Operator given option of type of validation, duration, and to skip validation

    • Validation caught 6 out of 9 mistakes from 8 experiments with validation

  • Mistake-injection experiments

    • Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput)

  • Operator-emulation experiments

    • Operator command scripts derived from the 42 operator mistakes

    • Both trace-based and replica validation caught 22 mistakes

      • Multi-component validation caught 4 latent (component interaction) mistakes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Reduction in impact with validation

Reduction in impact with validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Fewer mistakes with validation

Fewer mistakes with validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Shunting buffering overheads

Shunting & buffering overheads

  • Shunting overhead for replica-based validation  39% additional CPU

    • All requests and responses are captured and forwarded to validation slice

    • Trace-based validation is slightly better  32 % additional CPU

    • Overhead is incurred on single component, and only during validation

  • Various optimizations can reduce overhead to 13-22%

    • Examples: response summary (64byte), sampling (session boundaries)

  • Buffering capacity during state check pointing and duplication

    • Required to buffer only about 150 requests for small state sizes

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


Caveats limitations open issues

Caveats, limitations & open Issues

  • Non-determinism increases complexity of comparators and proxies

    • E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps

  • Hard state management may require operator intervention

    • Component requires initialization prior to online migration

  • Bootstrapping the validation

    • Validating an intended modification of service behavior – nothing to compare with!

  • How long to validate? What types of validation?

    • Duration spent in validation implies reduced online capacity

  • Future work: Taking validation further…

    • Validate operator actions on databases, network components

    • Combine validation with diagnosis for assisting operators

    • Other validation techniques: Model-based validation

CS 395/495 Autonomic Computing SystemsEECS,Northwestern University


  • Login