Why do upgrades fail and what can we do about it
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Why Do Upgrades Fail? (and What Can We Do About It?) PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Why Do Upgrades Fail? (and What Can We Do About It?). Tudor Dumitraş. Priya Narasimhan PARALLEL DATA LABORATORY Carnegie Mellon University. Upgrades in Enterprise Systems. Increasing cost of downtime Most outages due to planned events (e.g. software upgrades)

Download Presentation

Why Do Upgrades Fail? (and What Can We Do About It?)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Why do upgrades fail and what can we do about it

Why Do Upgrades Fail?(and What Can We Do About It?)

Tudor Dumitraş

Priya Narasimhan

PARALLEL DATA LABORATORY

Carnegie Mellon University


Upgrades in enterprise systems

Upgrades in Enterprise Systems

  • Increasing cost of downtime

  • Most outages due to planned events (e.g. software upgrades)

  • Software upgrades are unreliable

    • AOL (1996):outage, routing-table corruption

    • AT&T Wireless (2003): outages, data-loss, $100M loss

    • Hospital system (2006): medication unavailable in ER

  • Causes of upgrade failures:1

    1. Broken dependencies

    1. Removed behavior

    2. Bugs in new version

    3. Incompatibility with legacy configurations

1 Survey from [Crameri et al., SOSP 2007]

Why Do Upgrades Fail and What Can We Do About It?


Outline

Outline

  • Reasons for unplanned downtime

    • Four types of upgrade faults

    • Upgrade-fault frequencies

    • Upgrade-fault impacts

  • Reasons for planned downtime

  • Trade-offs for online upgrades

  • Challenges for reliable, online upgrades

Why Do Upgrades Fail and What Can We Do About It?

3


Upgrade faults

1

3

٢

1

3

2

1

a

3

3

1

2

Upgrade Faults

  • Procedure violations

    • Omitted action

    • Incorrect action

    • Spurious action

    • Order inversion

  • Three sources of upgrade-fault data

    • User study [Nagaraja et al., OSDI 2004]

    • Survey [Oliveira et al., USENIX 2006]

    • Field study (Apache bug reports from 2007)

2

1

3

Procedure violations occur in 43% of cases

Why Do Upgrades Fail and What Can We Do About It?

4


Hidden dependencies

Hidden Dependencies

Cannot be detected automatically or are overlooked due to their complexity

Why Do Upgrades Fail and What Can We Do About It?


Hidden dependency examples

Hidden-Dependency Examples

Why Do Upgrades Fail and What Can We Do About It?

6

  • Service location

    • File path

    • Network address

  • Dynamic linking

    • Library conflicts

    • Defective components

  • Database schema

    • Application/DB mismatch

    • Missing indexes

  • Access privileges(excessive / insufficient / unavailable)

    • File system

    • Database objects

    • URLs

  • Configuration-parameter constraints

  • Replication degree

  • Storage-space availability

  • Client access to system-under-upgrade

  • Cached data

    • SSL certificates

    • DNS lookups

    • Buffer cache

  • Listening-port conflicts

  • Protocol mismatch

  • Entropy for random-number generation

  • Request scheduling

  • Disk speed


Statistical cluster analysis

Statistical Cluster Analysis

Distance

0.2

0.4

0.0

0.6

0.8

1.0

Type 2: semantic configuration errors

f

f

f

f

f

f

f

f

u

f

Type 1: simple configuration or procedural errors

u

f

u

u

u

f

u

f

u

u

u

u

Type 3: broken environmental dependencies

u

f

f

f

f

f

f

f

f

f

f

f

Type 4: data-access errors

s

s

s

s

s

s

s

s

s

s

s

s

s

s

u

s

u

Fault source: user study (u), survey (s), field study (f)

s

u

s

s

Why Do Upgrades Fail and What Can We Do About It?

7


Upgrade fault frequencies

Upgrade-Fault Frequencies

25

20

15

System Administrators

10

5

0

1%

5%

10%

20%

25%

30%

40%

50%

60%

80%

90%

Failure Frequency [Crameri et al., SOSP 2007]

User study

[

x

]

Type 4

Survey

Type 3

Probability density

x

Frequency estimate

Type 2

[ ]

Confidence interval

[

x

]

Type 1

0%

10%

20%

30%

40%

50%

60%

Fault Frequency

Mean:8.6%

Max:50%

Why Do Upgrades Fail and What Can We Do About It?

8


Tolerating upgrade faults

Tolerating Upgrade Faults

  • Type 1

    • Check the syntax of configuration files

    • Currently, catching 38% – 83% of typos1

  • Type 2

    • Check constraints among configuration parameters1

  • Type 3

    • Package management (e.g. Microsoft Update, Debian APT)

  • Type 4

    • Validate the actions of database administrators2

      Best practice: phased deployment, to minimize risks3

1 Keller et al., DSN 2008

2 Oliveira et al., USENIX 2006

3 Information Technology Infrastructure Library (ITIL) 2007

Why Do Upgrades Fail and What Can We Do About It?

9


Upgrade fault impacts

Upgrade-Fault Impacts

  • Rolling Upgrades

  • Big Flip

  • Imago

Rolling Upgrades

Big Flip

Imago

Fault impact

3

Latent error

Security vulnerability

2

Faults injected

Increased latency

1

Degraded throughput

Full outage

0

1

2

3

4

1

2

3

4

1

2

3

4

Fault type

Why Do Upgrades Fail and What Can We Do About It?

10


Why do upgrades fail

Why Do Upgrades Fail?

Atomic end-to-end upgrades can be more reliable than piecewise, phased upgrades

Why Do Upgrades Fail and What Can We Do About It?

11

  • Online upgrades: more vulnerable to upgrade faults

  • Rolling upgrades: break hidden dependencies

    • Can have a global impact

    • Due to states with mixed versions

  • Big flip: has single points of failure

    • Example: the database (vulnerable to Type 4 faults)

  • In-place upgrades: introduce latent errors


Why do we break dependencies

Why Do We Break Dependencies?

Datastore

Front-end

Some dependencies cannot be detected automatically

Dependency-resolution is NP-complete

Shared-library dependencies

Why Do Upgrades Fail and What Can We Do About It?

12


Reasons for planned downtime 1

Reasons for Planned Downtime (1)

From the history of Wikipedia upgrades:

old_title=cur_title

cur_id

Why Do Upgrades Fail and What Can We Do About It?


Reasons for planned downtime 2

Reasons for Planned Downtime (2)

Offline upgrade

Online upgrade

ALTER TABLE old

ADD COLUMN old_id

INT(8) UNSIGNED NOT NULL;

UPDATE old,cur

SET old_id=cur_id

WHERE old_title=cur_title;

ALTER TABLE old

DROP COLUMN old_title;

INSERT:

old: addold.old_idcolumn

cur: updateold.old_id

UPDATE:

old.old_title: updateold.old_id

cur.cur_title: updateold.old_id

cur.cur_id: updateold.old_id

DELETE:

old: delete row

cur: updateold.old_id

Stop using old schema

Cannot compute incrementally

Why Do Upgrades Fail and What Can We Do About It?


Reasons for planned downtime 3

Reasons for Planned Downtime (3)

  • DB index redefinitions

  • Sync changes to application servers and DB

  • Drop columns from DB

  • Table joins in DB

  • Aggregates (e.g., max(), min())

  • Convert article text to UTF8

    • Long running

    • Bulk updates can hang up DB replication

  • Might overload the infrastructure

Why Do Upgrades Fail and What Can We Do About It?


Trade offs for online upgrades

Trade-offs for Online Upgrades

In-Place

Out-of-Place

  • Additional HW

  • Additional storage

  • Risk of propagating corrupted data

  • Need indirection layer

    • Potential overhead

    • Installation downtime

  • Risk of propagating corrupted data

  • Need indirection layer

    • Potential overhead

    • Installation downtime

  • Conversions impose overhead

  • Risk of breaking hidden dependencies

Mixed Versions

  • Additional storage

  • Conversions impose overhead

  • Risk of breaking hidden dependencies

  • Additional HW

  • Additional storage

No MixedVersions

Why Do Upgrades Fail and What Can We Do About It?

16


Conclusions

Conclusions

  • Hidden dependencies cause upgrade failures

    • Localized changes can induce global failures

  • Dependency tracking has fundamental limitations

  • DB-schema changes often impose downtime

  • Challenges for reliable, online upgrades:

    • Handling hidden dependencies

    • Computationally-intensive data conversions

    • Upgrade testing and fault management

Why Do Upgrades Fail and What Can We Do About It?

17


Backup slides

Backup Slides

Why Do Upgrades Fail and What Can We Do About It?


Reasons for upgrading

Reasons for Upgrading

Protect against attacks and errors

No changes to interfaces or data formats

Add new features

Backwards compatible

Migrate to new platform (end-of-life, efficiency reasons)

Data-conversions required

Switch vendors

Different systems, not different versions

Interface changes

Change business processes

Arbitrary changes

Why Do Upgrades Fail and What Can We Do About It?

19


Classification methodology

Classification Methodology

  • 55 distinct faults from three studies

  • Five classification variables:

    • Root cause of fault (e.g. procedure, configuration)

    • Broken hidden-dependency (if any)

    • Fault location

    • Original classification

    • Cognitive level involved

      • Skill-based: simple, repetitive tasks

      • Rule-based: problems solved by pattern-matching

      • Knowledge-based: reasoning from first principles

Why Do Upgrades Fail and What Can We Do About It?

20


Upgrade fault characteristics

Upgrade-Fault Characteristics

s

Database schemas

s

s

s

Type 4

s

Storage-space availability

s

s

s

s

Access privileges

s

s

s

s

f

s

s

s

Request scheduling

u

s

u

s

f

u

f

Cached data

f

f

f

Broken hidden-dependency

Parameter constraints

Type 2

Type 3

f

f

u

u

Shared libraries

f

f

f

Listening ports

f

f

f

f

f

Communication protocols

Type 1

f

f

f

f

u

Network addresses

f

u

u

f

u

File paths

u

u

u

u

f

Replication degrees

u

Configuration faults

Procedural faults

Why Do Upgrades Fail and What Can We Do About It?

21


  • Login