Dependability in the Internet Era. Outline. The glorious past (Availability Progress) The dark ages (current scene) Some recommendations. Preview The Last 5 Years: Availability Dark Ages Ready for a Renaissance?. Things got better, then things got a lot worse!. 99.999%.
Lifecycle of a module
short fault latency
is low UN-Availability
Unavailability ~ MTTR
Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.
Vendor (hardware and software) 5 Months
Application software 9 Months
Communications lines 1.5 Years
Operations 2 Years
Environment 2 Years
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas
Tele Comm lines
Shift from Hardware & Maintenance to from 50% to 10%
to Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of Environment
Masks some hardware failures
well-managed packs & clones
Masks hardware failures,
Operations tasks (e.g. software upgrades)
Masks some software failures
Masks site failures (power, network, fire, move,…)
Masks some operations failures
Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html
is the positive use of the term, here I mean malicious and anti-social hackers.
Connectivity is poor.
Down 30 hours in July (hardware stop, auto restart failed, operations failure)
Down 26 hours in September (Backplane failure, I/O Bus failure)BackEnd Servers are More Stable
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.
Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9
http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.
http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices
http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!