Dependability in the Internet Era. Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13 Feb 2005 Stanford, CA. Outline. The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.
High Dependability Computing Consortium Conference
Santa Cruz, CA 7 May 2001
REVISED: 13 Feb 2005 Stanford, CA
phonesPreviewThe Last 10 Years: Availability Dark AgesReady for a Renaissance?
Lifecycle of a module
short fault latency
is low UN-Availability
Unavailability ~ MTTR
Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.
Masks some hardware failures
well-managed packs & clones
Masks hardware failures,
Operations tasks (e.g. software upgrades)
Masks some software failures
Masks site failures (power, network, fire, move,…)
Masks some operations failures
Vendor (hardware and software) 5 Months
Application software 9 Months
Communications lines 1.5 Years
Operations 2 Years
Environment 2 Years
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas
Tele Comm lines
Shift from Hardware & Maintenance to from 50% to 10%
to Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of Environment
Phones delivered 99.999%
ATMs delivered 99.99%
Failures were front-page news.
Outages last an “hour”
Cell phones deliver 90%
Web sites deliver 99%
Failures are business-page news
Outages last a “day”The Internet Changed Expectations
This is progress?
Focus on commit
Difficult evolution (e.g. schema)
BasicAvailabilitySoft StateEventual Consistency
Weak consistencystale data is OKApproximate answers OK
Eric Brewer said it best:ACID vs BASEthe internet litmus test“copy” of slide 8 of http://www.ccs.neu.edu/groups/IEEE/ind-acad/brewer/sld008.htm
I think it is a spectrum
QualityWhy (2) Velocity
Reporter: “Why did you rob banks?”
Willie Sutton: “Cause that’s where the money is!”
Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html
is the positive use of “Hacker”, here I mean malicious and anti-social hackers.
Black-hats, not white-hats.
Connectivity is poor.
Measures response time around the world
Business service is better than popular service
Has many proprietary services for SLAs.
Down 30 hours in July (hardware stop, auto restart failed, operations failure)
Down 26 hours in September (Backplane failure, I/O Bus failure)Back-End Servers are More Stable
Welcome to eBay's System Board. Visit this board for information on scheduled site maintenance or system issues that are affecting Marketplace trading. For general eBay news, please see our General Announcements Board.
***Resolved - PayPal site slowness***
February 08, 2006 | 05:20PM PST/PTFor several hours today, members may have experienced slowness while trying to access the PayPal website. This issue has now been resolved. AThank you for your patience.
Link to this announcement | Back to top
***PayPal site slowness***
February 08, 2006 | 02:38PM PST/PTMembers may be experiencing intermittent slowness while trying to access the PayPal website. We're aware of this issue and are working to fix it as quickly as possible. Thank you for your patience.
Link to this announcement | Back to top
***Scheduled Maintenance For This Week***
February 08, 2006 | 02:03PM PST/PTThe eBay system will be undergoing general maintenance from approximately 23:00 PT on Thursday, February 9th to 01:00 PT on Friday, February 10th. During this maintenance period, certain eBay site features may be intermittently unavailable or slow.
People WANT convenience!
People WANT cheap!
In exchange,they seem to be willing to tolerate some
Un-availability (= inconvenience)
“Dirty data” that needs reconciliation
I see it as our task to make it easier & cheaperto get high availability and Security.
QualityGresham’s Law:“bad money drives out good”
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.
Theory and Practice of Reliable System Design, Dan Siewiorek, Robert Swarz
Building Secure and Reliable Network Applications, Ken P. Birman
Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proc of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, 1995, p. 2-9
http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.
http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices
Empirical Measurements of Disk Failure Rates and Error Rates + C .van Ingen moving 2P with cheap iron
“Consensus on Transaction Commit”, +, L. Lamport, unifies 2PC and Byzantie-Paxos