dependability in the internet era n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Dependability in the Internet Era PowerPoint Presentation
Download Presentation
Dependability in the Internet Era

Loading in 2 Seconds...

play fullscreen
1 / 31

Dependability in the Internet Era - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

Dependability in the Internet Era. Outline. The glorious past (Availability Progress) The dark ages (current scene) Some recommendations. Preview The Last 5 Years: Availability Dark Ages Ready for a Renaissance?. Things got better, then things got a lot worse!. 99.999%.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dependability in the Internet Era' - natan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations
preview the last 5 years availability dark ages ready for a renaissance
PreviewThe Last 5 Years: Availability Dark AgesReady for a Renaissance?
  • Things got better, then things got a lot worse!

99.999%

Telephone Systems

99.999%

99.99%

Availability

Cell

phones

99.9%

Computer Systems

99%

Internet

9%

1950

1960

1970

1980

1990

2000

dependability the 3 ities
DEPENDABILITY: The 3 ITIES
  • RELIABILITY / INTEGRITY: Does the right thing.(also MTTF>>1)
  • AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?(=>89% of transactions are serviced on time).
  • Holistic vs. Reductionist view

Security

Integrity

Reliability

Availability

fail fast is good repair is needed
Fail-Fast is Good, Repair is Needed

Lifecycle of a module

fail-fast gives

short fault latency

High Availability

is low UN-Availability

Unavailability ~ MTTR

MTTF

Improving either MTTR or MTTF gives benefit

Simple redundancy does not help much.

fault model
Fault Model
  • Failures are independentSo, single fault tolerance is a big win
  • Hardware fails fast (dead disk, blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot:
    • Heisenbugs
  • Operations tasks: major source of outage
    • Utility operations
    • Software upgrades
disks raid the big success story
Disks (raid) the BIG Success Story
  • Duplex or Parity: masks faults
  • Disks @ 1M hours (~100 years)
  • But
    • controllers fail and
    • have 1,000s of disks.
  • Duplexing or parity, and dual path gives “perfect disks”
  • Wal-Mart never lost a byte (thousands of disks, hundreds of failures).
  • Only software/operations mistakes are left.
fault tolerance vs disaster tolerance
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance: mask local faults
    • RAID disks
    • Uninterruptible Power Supplies
    • Cluster Failover
  • Disaster Tolerance: masks site failures
    • Protects against fire, flood, sabotage,..
    • Redundant system and service at remote site.
case study japan survey on computer security japan info dev corp march 1986 trans eiichi watanabe
Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).

Vendor

Vendor (hardware and software) 5 Months

Application software 9 Months

Communications lines 1.5 Years

Operations 2 Years

Environment 2 Years

10 Weeks

1,383 institutions reported (6/84 - 7/85)

7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES

To Get 10 Year MTTF, Must Attack All These Areas

4

2

%

Tele Comm lines

1

2

%

1

1

.

2

Environment

%

2

5

%

Application

Software

9

.

3

%

Operations

case studies tandem trends
Case Studies - Tandem Trends

MTTF improved

Shift from Hardware & Maintenance to from 50% to 10%

to Software (62%) & Operations (15%)

NOTE: Systematic under-reporting of Environment

Operations errors

Application Software

dependability status circa 1995
Dependability Status circa 1995
  • ~4-year MTTF => 5 9s for well-managed sys. Fault Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations:
    • New Software.
    • Utilities.
  • Make all hardware/software changes ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal: 100-year MTTF. class 4 today=>class 6 tomorrow.
what s happened since then
What’s Happened Since Then?
  • Hardware got better
  • Software got better (even though it is more complex)
  • Raid is standard, Snapshots coming standard
  • Cluster in a box: commodity failover
  • Remote replication is standard.
availability

9

9

9

9

9

Availability

Un-managed

Availability

well-managed nodes

Masks some hardware failures

well-managed packs & clones

Masks hardware failures,

Operations tasks (e.g. software upgrades)

Masks some software failures

well-managed GeoPlex

Masks site failures (power, network, fire, move,…)

Masks some operations failures

outline1
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations
progress
Progress?
  • MTTF improved from 1950-1995
  • MTTR has not improved much since 1970 failover
  • Hardware and Software online change (pNp) is now standard
  • Then the Internet arrived:
    • No project can take more than 3 months.
    • Time to market is everything
    • Change is good.
the internet changed expectations
1990

Phones delivered 99.999%

ATMs delivered 99.99%

Failures were front-page news.

Few hackers

Outages last an “hour”

2000

Cellphones deliver 90%

Web sites deliver 98%

Failures are business-page news

Many hackers.

Outages last a “day”

The Internet Changed Expectations

This is progress?

why 1 complexity
Why (1) Complexity
  • Internet sites are MUCH more complex.
    • NAP
    • Firewall/proxy/ipsprayer
    • Web
    • DMZ
    • App server
    • DB server
    • Links to other sites
    • tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/os…
  • Skill level is much reduced
a schematic of hotmail
A Schematic of HotMail
  • ~7,000 servers
  • 100 backend stores with 120TB (cooked)
  • 3 data centers
  • Links to
    • Passport
    • Ad-rotator
    • Internet Mail gateways
  • ~ 1B messages per day
  • 150M mailboxes, 100M active
  • ~400,000 new per day.
why 2 velocity

Functionality

trend

Schedule

Quality

Why (2) Velocity
  • No project can take more than 13 weeks.
  • Time to market is everything
  • Functionality is everything
  • Faster, cheaper, badder 
why 3 hackers
Why (3) Hackers
  • Hacker’s are a new increased threat
  • Any site can be attacked from anywhere
  • Motives include ego, malice, and greed.
  • Complexity makes it hard to protect sites.
  • Concentration of wealth makes attractive target:
  • Why did you rob banks?
  • Willie Sutton: Cause that’s where the money is!

Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html

is the positive use of the term, here I mean malicious and anti-social hackers.

how bad is it
How Bad Is It?

http://www-iepm.slac.stanford.edu/

Connectivity is poor.

how bad is it1
How Bad Is It?

http://www-iepm.slac.stanford.edu/pinger/

  • Median monthly % ping packet loss for 2/ 99
microsoft com
Microsoft.Com
  • Operations mis-configured a router
  • Took a day to diagnose and repair.
  • DOS attacks cost a fraction of a day.
  • Regular security patches.
backend servers are more stable

Year 1

Through

18

Months

Down 30 hours in July (hardware stop, auto restart failed, operations failure)

Down 26 hours in September (Backplane failure, I/O Bus failure)

BackEnd Servers are More Stable
  • Generally deliver 99.99%
  • TerraServer for example single back-end failed after 2.5 y.
  • Went to 4-nodecluster
  • Fails every 2 mo.Transparent failover in 30 sec.Online software upgradesSo… 99.999% in backend…
ebay a very honest site
eBay: A very honest site

http://www2.ebay.com/aw/announce.shtml#top

  • Publishes operations log.
  • Has 99% of scheduled uptime
  • Schedules about 2 hours/week down.
  • Has had some operations outages
  • Has had some DOS problems.
outline2
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations
not to throw stones but
Not to throw stones but…
  • Everyone has a serious problem.
  • The BEST people publish their stats.
  • The others HIDE their stats (check Netcraft to see who I mean).
  • We have good NODE-level availability 5-9s is reasonable.
  • We have TERRIBLE system-level availability 2-9s is the goal.
recommendation 1
Recommendation #1
  • Continue progress on back-ends.
    • Make management easier (AUTOMATE IT!!!)
    • Measure
    • Compare best practices
    • Continue to look for better algoritims.
  • Live in fear
    • We are at 10,000 node servers
    • We are headed for 1,000,000 node servers
recommendation 2
Recommendation #2
  • Current security approach is unworkable:
    • Anonymous clients
    • Firewall is clueless
    • Incredible complexity
  • We cant win this game!
  • So change the rules (redefine the problem):
    • No anonymity
    • Unified authentication/authorization model
    • Single-function devices (with simple interfaces)
    • Only one-kind of interface (uddi/wsdl/soap/…).
references
References

Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0

Anderson, T. and B. Randell. (1979). Computing Systems Reliability.

Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.

Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12.

Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418.

Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann.

Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag.

Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.

Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.

Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9

http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.

http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices

http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!