Dependability in the internet era
Download
1 / 45

Dependability - PowerPoint PPT Presentation


  • 234 Views
  • Updated On :

Dependability in the Internet Era. Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13 Feb 2005 Stanford, CA. Outline. The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dependability ' - ivanbritt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Dependability in the internet era l.jpg

Dependability in the Internet Era

Jim Gray

Microsoft Research

High Dependability Computing Consortium Conference

Santa Cruz, CA 7 May 2001

REVISED: 13 Feb 2005 Stanford, CA


Outline l.jpg
Outline

  • The glorious past (Availability Progress)

  • The dark ages (current scene)

  • Some recommendations


Preview the last 10 years availability dark ages ready for a renaissance l.jpg

Telephone Systems

Computer Systems

Internet

Cell

phones

PreviewThe Last 10 Years: Availability Dark AgesReady for a Renaissance?

  • Things got better, then things got a lot worse!

99.999%

99.999%

99.99%

Availability

99.9%

99%

9%

1950

1960

1970

1980

1990

2000

2010


Dependability the 3 ities l.jpg
DEPENDABILITY: The 3 ITIES

  • RELIABILITY / INTEGRITY: Does the right thing.(also MTTF>>1)

  • AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?(=>89% of transactions are serviced on time).

  • Holistic vs. Reductionist view

Security

Integrity

Reliability

Availability


Fail fast is good repair is needed l.jpg
Fail-Fast is Good, Repair is Needed

Lifecycle of a module

fail-fast gives

short fault latency

High Availability

is low UN-Availability

Unavailability ~ MTTR

MTTF

Improving either MTTR or MTTF gives benefit

Simple redundancy does not help much.


Fault model l.jpg
Fault Model

  • Failures are independentSo, single fault tolerance is a big win

  • Hardware fails fast (dead disk, blue-screen)

  • Software fails-fast (or goes to sleep)

  • Software often repaired by reboot:

    • Heisenbugs

  • Operations tasks: major source of outage

    • Utility operations

    • Software upgrades


Disks raid the big success story l.jpg
Disks (raid) the BIG Success Story

  • Duplex or Parity: masks faults

  • Disks @ 1M hours (~100 years)

  • But

    • controllers fail and

    • have 1,000s of disks.

  • Duplexing or parity, and dual path gives “perfect disks”

  • Wal-Mart never lost a byte (thousands of disks, hundreds of failures).

  • Only software/operations mistakes are left.


Fault tolerance vs disaster tolerance l.jpg
Fault Tolerance vs Disaster Tolerance

  • Fault-Tolerance: mask local faults

    • RAID disks

    • Uninterruptible Power Supplies

    • Cluster Failover

  • Disaster Tolerance: masks site failures

    • Protects against fire, flood, sabotage,..

    • Also, software changes, site moves,…

    • Redundant system and service at remote site.


Availability l.jpg

9

9

9

9

9

Availability

Un-managed

Availability

well-managed nodes

Masks some hardware failures

well-managed packs & clones

Masks hardware failures,

Operations tasks (e.g. software upgrades)

Masks some software failures

well-managed GeoPlex

Masks site failures (power, network, fire, move,…)

Masks some operations failures


Case study japan survey on computer security japan info dev corp march 1986 trans eiichi watanabe l.jpg
Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).

Vendor

Vendor (hardware and software) 5 Months

Application software 9 Months

Communications lines 1.5 Years

Operations 2 Years

Environment 2 Years

10 Weeks

1,383 institutions reported (6/84 - 7/85)

7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES

To Get 10 Year MTTF, Must Attack All These Areas

4

2

%

Tele Comm lines

1

2

%

1

1

.

2

Environment

%

2

5

%

Application

Software

9

.

3

%

Operations


Case studies tandem trends l.jpg
Case Studies - Tandem Trends

MTTF improved

Shift from Hardware & Maintenance to from 50% to 10%

to Software (62%) & Operations (15%)

NOTE: Systematic under-reporting of Environment

Operations errors

Application Software


Dependability status circa 1995 l.jpg
Dependability Status circa 1995

  • ~4-year MTTF

    • 5 9s for well-managed sys. Fault Tolerance Works.

  • Hardware is GREAT (maintenance and MTTF).

  • Software masks most hardware faults.

  • Many hidden software outages in operations:

    • New Software.

    • Utilities.

    • Need to make all hardware/software changes ONLINE.

  • Software seems to define a 30-year MTTF ceiling.

  • Reasonable Goal: 100-year MTTF. class 4 today=>class 6 tomorrow.


  • Honorable mention l.jpg
    Honorable Mention

    • The nice folks at Tandem (now HP))

      • Made failover fast (30 seconds or less).

      • Made change online

        • Add hardware/software

        • Reorganize database.

        • Rolling upgrades.

      • Added at least one 9 to their story.


    And then l.jpg
    And Then?

    • Hardware got better (& more complex)

    • Software got better (& more complex)

    • Raid is standard, Snapshots becoming standard

    • Cluster in a box: commodity failover

    • Remote replication is standard.


    Outline15 l.jpg
    Outline

    • The glorious past (Availability Progress)

    • The dark ages (current scene)

    • Some recommendations


    Progress l.jpg

    Telephone Systems

    Computer Systems

    Internet

    Cell

    phones

    Progress?

    • MTTF improved from 1950-1995

    • MTTR incremental improvements 1970 --- failover

    • Hardware and Software online change (pNp) is now standard

    • Then the Internet arrived:

      • No project can take more than 3 months.

      • Time to market is everything

      • Change is good.


    The internet changed expectations l.jpg

    1990

    Phones delivered 99.999%

    ATMs delivered 99.99%

    Failures were front-page news.

    Few hackers

    Outages last an “hour”

    2005

    Cell phones deliver 90%

    Web sites deliver 99%

    Failures are business-page news

    Many hackers.

    Outages last a “day”

    The Internet Changed Expectations

    This is progress?


    Slide18 l.jpg

    AtomicityConsistencyIsolationDurabilty

    Availability?

    Strong consistencyIsolation

    Focus on commit

    Conservative (Pessimistic)

    Difficult evolution (e.g. schema)

    Nested transactions

    BasicAvailabilitySoft StateEventual Consistency

    Availability FIRST

    Weak consistencystale data is OKApproximate answers OK

    Best effort

    Aggressive (optimistic)

    Easier Evolution.

    Simpler!

    Faster

    Eric Brewer said it best:ACID vs BASEthe internet litmus test“copy” of slide 8 of http://www.ccs.neu.edu/groups/IEEE/ind-acad/brewer/sld008.htm

    I think it is a spectrum


    Why 1 complexity l.jpg
    Why (1) Complexity

    • Internet sites are MUCH more complex.

      • NAP

      • Firewall/proxy/IPsprayer

      • Web

      • DMZ

      • App server

      • DB server

      • Links to other sites

      • tcp/http/html/dhtml/dom/xml/ com/corba/cgi/sql/fs/os…

    • Skill level is much reduced


    A data center 500 servers l.jpg
    A Data Center (500 servers)


    A schematic of hotmail l.jpg
    A Schematic of HotMail

    Member

    MSERVS

    Front

    MSERVS

    Directory

    • ~7,000 servers

    • 100 backend stores with 300TB (cooked)

    • many data centers

    • Links to

      • Internet Mail gateways

      • Ad-rotator

      • Passport

    • ~ 5 B messages per day

    • 350M mailboxes, 250M active

    • ~1M new per day.

    • New software every 3 months(small changes weekly).

    Doors

    Local Director

    MSERVS

    Local Director

    MSERVS

    Graphics

    MSERVS

    Servers

    Local Director

    Data

    MSERVS

    Data

    Swittched Ethernet

    MSERVS

    Internet

    AD Servers

    Data

    Data

    Local Director

    USTORES

    Incoming

    MSERVS

    MSERVS

    MailServer

    s

    Local Director

    Telnet Management

    MSERVS

    Login

    MSERVS

    gateway

    gateway

    Servers

    gateway

    Local Director

    gateway

    gateway


    Why 2 velocity l.jpg

    Functionality

    trend

    Schedule

    Quality

    Why (2) Velocity

    • No project can take more than 13 weeks.

    • Time to market is everything

    • Functionality is everything

    • Faster, cheaper, …


    Why 3 hackers l.jpg
    Why (3) Hackers

    • Hacker’s are a new increased threat

    • Any site can be attacked from anywhere

    • Motives include ego, malice, and greed.

    • Complexity makes it hard to protect sites.

    • Whole internet attacks: Slammer

    • Concentration of wealth makes attractive target:

      Reporter: “Why did you rob banks?”

      Willie Sutton: “Cause that’s where the money is!”

    Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html

    is the positive use of “Hacker”, here I mean malicious and anti-social hackers.

    Black-hats, not white-hats.


    How bad is it l.jpg
    How Bad Is It?

    http://www-iepm.slac.stanford.edu/

    Connectivity is poor.

    http://www.internettrafficreport.com/main.htm


    How bad is it25 l.jpg
    How Bad Is It?

    http://www-iepm.slac.stanford.edu/pinger/

    • Median monthly % ping packet loss for 2/ 99


    And in 2006 about the same l.jpg
    And in 2006, about the same



    Keynote measures response time and up time l.jpg
    Keynote measures Response Timeand Up Time

    Measures response time around the world

    Business service is better than popular service

    Has many proprietary services for SLAs.




    Service level measurements l.jpg
    Service Level Measurements

    • Many organizations are measured on SLAs

    • Example: 1 sec response 99% of prime time

    • Keynote, Netcraft, …

      • offer to monitor you site (probe every few min)

        • This probing can go deep into the tree to detect services.

      • Send alerts via email

      • Give monthly reports.


    In addition l.jpg
    In addition

    • Most large sites build their own instrumentation (several times )

    • This instrumentation is elaborate and essential for the Network Operations Center (NOC).

    • There are attempts now to systematize itTivoli, OpenView, NetIQ, WhatsUP, Mom,..


    Microsoft com l.jpg
    Microsoft.Com

    • Operations mis-configured a router

    • Took a day to diagnose and repair.

    • DOS attacks cost a fraction of a day.

    • Regular security patches.


    Back end servers are more stable l.jpg

    Year 1

    Through

    18

    Months

    Down 30 hours in July (hardware stop, auto restart failed, operations failure)

    Down 26 hours in September (Backplane failure, I/O Bus failure)

    Back-End Servers are More Stable

    • Generally deliver 99.99%

    • TerraServer for example single back-end failed after 2.5 y.

    • Went to 4-nodecluster

    • Fails every 2 mo.Transparent failover in 30 sec.Online software upgradesSo… 99.999% in backend…


    Ebay a very honest site l.jpg
    eBay: A very honest site

    http://www2.ebay.com/aw/announce.shtml

    • Publishes operations log.

    • Has 99% of scheduled uptime

    • Schedules about 2 hours/week down.

    • Has had some operations outages

    • Has had some DOS problems.


    And 2006 l.jpg
    And 2006….

    http://www2.ebay.com/aw/announce.shtml

    Welcome to eBay's System Board. Visit this board for information on scheduled site maintenance or system issues that are affecting Marketplace trading. For general eBay news, please see our General Announcements Board.

    ***Resolved - PayPal site slowness***

    February 08, 2006 | 05:20PM PST/PTFor several hours today, members may have experienced slowness while trying to access the PayPal website. This issue has now been resolved. AThank you for your patience.

    Link to this announcement | Back to top

    ***PayPal site slowness***

    February 08, 2006 | 02:38PM PST/PTMembers may be experiencing intermittent slowness while trying to access the PayPal website. We're aware of this issue and are working to fix it as quickly as possible. Thank you for your patience.

    Link to this announcement | Back to top

    ***Scheduled Maintenance For This Week***

    February 08, 2006 | 02:03PM PST/PTThe eBay system will be undergoing general maintenance from approximately 23:00 PT on Thursday, February 9th to 01:00 PT on Friday, February 10th. During this maintenance period, certain eBay site features may be intermittently unavailable or slow.


    Some cool new things l.jpg
    Some Cool New Things

    • There are 100,000 node services.

    • Google File System shows importance & benefit of Triplex

    • DB replication & mirroring works (is easy)

    • little things I have done

      • With Leslie Lamport: unified Paxos & 2PC

      • Measured mean-time-to-data-loss(and continue to measure things).


    Outline39 l.jpg
    Outline

    • The glorious past (Availability Progress)

    • The dark ages (current scene)

    • Some recommendations


    Not to throw stones but l.jpg
    Not to throw stones but…

    • Everyone has a serious problem.

    • The BEST people publish their stats.

    • The others HIDE their stats (check Netcraft to see who I mean).

    • We have good NODE-level availability 5-9s is reasonable.

    • We have TERRIBLE system-level availability 2-9s “scheduled” is the goal (!).


    Gresham s law bad money drives out good l.jpg

    People WANT features!

    People WANT convenience!

    People WANT cheap!

    In exchange,they seem to be willing to tolerate some

    Un-availability (= inconvenience)

    “Dirty data” that needs reconciliation

    Insecurity

    I see it as our task to make it easier & cheaperto get high availability and Security.

    Functionality

    trend

    Schedule

    Quality

    Gresham’s Law:“bad money drives out good”


    Recommendation 1 l.jpg
    Recommendation #1

    • Continue progress on back-ends.

      • Make management easier (AUTOMATE IT!!!)

      • Measure

      • Compare best practices

      • Continue to look for better algoritims.

    • Live in fear

      • We are at 10,000 node servers

      • We are headed for 1,000,000 node servers


    Recommendation 2 l.jpg
    Recommendation #2

    • Current security approach is unworkable:

      • Anonymous clients

      • Firewall is clueless

      • Incredible complexity

    • We cant win this game!

    • So change the rules (redefine the problem):

      • No anonymity

      • Unified authentication/authorization model

      • Single-function devices (with simple interfaces)

      • Only one-kind of interface (uddi/wsdl/soap/…).


    Recommendation 3 l.jpg
    Recommendation #3

    • Dependability requires holistic not reductionist approach.

    • It’s the WHOLE system (end-to-end, top-to-bottom)

    • Hard to publish in this area, hard to get tenure.

      • Journals want theorem+proof and crisp statements.

    • Companies want to make money, so do not share their knowledge.

    • Dependability is an important social good,

    • So, it Dependability Research needs government or philanthropic sponsorship


    References l.jpg
    References

    Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0

    Anderson, T. and B. Randell. (1979). Computing Systems Reliability.

    Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.

    Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12.

    Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418.

    Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann.

    Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag.

    Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.

    Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.

    Theory and Practice of Reliable System Design, Dan Siewiorek, Robert Swarz

    Building Secure and Reliable Network Applications, Ken P. Birman

    Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proc of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, 1995, p. 2-9

    http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.

    http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices

    Empirical Measurements of Disk Failure Rates and Error Rates + C .van Ingen moving 2P with cheap iron

    “Consensus on Transaction Commit”, +, L. Lamport, unifies 2PC and Byzantie-Paxos   


    ad