1 / 31

Dependability in the Internet Era

Dependability in the Internet Era. Outline. The glorious past (Availability Progress) The dark ages (current scene) Some recommendations. Preview The Last 5 Years: Availability Dark Ages Ready for a Renaissance?. Things got better, then things got a lot worse!. 99.999%.

natan
Download Presentation

Dependability in the Internet Era

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dependability in the Internet Era

  2. Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations

  3. PreviewThe Last 5 Years: Availability Dark AgesReady for a Renaissance? • Things got better, then things got a lot worse! 99.999% Telephone Systems 99.999% 99.99% Availability Cell phones 99.9% Computer Systems 99% Internet 9% 1950 1960 1970 1980 1990 2000

  4. DEPENDABILITY: The 3 ITIES • RELIABILITY / INTEGRITY: Does the right thing.(also MTTF>>1) • AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?(=>89% of transactions are serviced on time). • Holistic vs. Reductionist view Security Integrity Reliability Availability

  5. Fail-Fast is Good, Repair is Needed Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ~ MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much.

  6. Fault Model • Failures are independentSo, single fault tolerance is a big win • Hardware fails fast (dead disk, blue-screen) • Software fails-fast (or goes to sleep) • Software often repaired by reboot: • Heisenbugs • Operations tasks: major source of outage • Utility operations • Software upgrades

  7. Disks (raid) the BIG Success Story • Duplex or Parity: masks faults • Disks @ 1M hours (~100 years) • But • controllers fail and • have 1,000s of disks. • Duplexing or parity, and dual path gives “perfect disks” • Wal-Mart never lost a byte (thousands of disks, hundreds of failures). • Only software/operations mistakes are left.

  8. Fault Tolerance vs Disaster Tolerance • Fault-Tolerance: mask local faults • RAID disks • Uninterruptible Power Supplies • Cluster Failover • Disaster Tolerance: masks site failures • Protects against fire, flood, sabotage,.. • Redundant system and service at remote site.

  9. Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor Vendor (hardware and software) 5 Months Application software 9 Months Communications lines 1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 4 2 % Tele Comm lines 1 2 % 1 1 . 2 Environment % 2 5 % Application Software 9 . 3 % Operations

  10. Case Studies - Tandem Trends MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software

  11. Dependability Status circa 1995 • ~4-year MTTF => 5 9s for well-managed sys. Fault Tolerance Works. • Hardware is GREAT (maintenance and MTTF). • Software masks most hardware faults. • Many hidden software outages in operations: • New Software. • Utilities. • Make all hardware/software changes ONLINE. • Software seems to define a 30-year MTTF ceiling. • Reasonable Goal: 100-year MTTF. class 4 today=>class 6 tomorrow.

  12. What’s Happened Since Then? • Hardware got better • Software got better (even though it is more complex) • Raid is standard, Snapshots coming standard • Cluster in a box: commodity failover • Remote replication is standard.

  13. 9 9 9 9 9 Availability Un-managed Availability well-managed nodes Masks some hardware failures well-managed packs & clones Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures well-managed GeoPlex Masks site failures (power, network, fire, move,…) Masks some operations failures

  14. Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations

  15. Progress? • MTTF improved from 1950-1995 • MTTR has not improved much since 1970 failover • Hardware and Software online change (pNp) is now standard • Then the Internet arrived: • No project can take more than 3 months. • Time to market is everything • Change is good.

  16. 1990 Phones delivered 99.999% ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2000 Cellphones deliver 90% Web sites deliver 98% Failures are business-page news Many hackers. Outages last a “day” The Internet Changed Expectations This is progress?

  17. Why (1) Complexity • Internet sites are MUCH more complex. • NAP • Firewall/proxy/ipsprayer • Web • DMZ • App server • DB server • Links to other sites • tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/os… • Skill level is much reduced

  18. One of the Data Centers (500 servers)

  19. A Schematic of HotMail • ~7,000 servers • 100 backend stores with 120TB (cooked) • 3 data centers • Links to • Passport • Ad-rotator • Internet Mail gateways • … • ~ 1B messages per day • 150M mailboxes, 100M active • ~400,000 new per day.

  20. Functionality trend Schedule Quality Why (2) Velocity • No project can take more than 13 weeks. • Time to market is everything • Functionality is everything • Faster, cheaper, badder 

  21. Why (3) Hackers • Hacker’s are a new increased threat • Any site can be attacked from anywhere • Motives include ego, malice, and greed. • Complexity makes it hard to protect sites. • Concentration of wealth makes attractive target: • Why did you rob banks? • Willie Sutton: Cause that’s where the money is! Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html is the positive use of the term, here I mean malicious and anti-social hackers.

  22. How Bad Is It? http://www-iepm.slac.stanford.edu/ Connectivity is poor.

  23. How Bad Is It? http://www-iepm.slac.stanford.edu/pinger/ • Median monthly % ping packet loss for 2/ 99

  24. Microsoft.Com • Operations mis-configured a router • Took a day to diagnose and repair. • DOS attacks cost a fraction of a day. • Regular security patches.

  25. Year 1 Through 18 Months Down 30 hours in July (hardware stop, auto restart failed, operations failure) Down 26 hours in September (Backplane failure, I/O Bus failure) BackEnd Servers are More Stable • Generally deliver 99.99% • TerraServer for example single back-end failed after 2.5 y. • Went to 4-nodecluster • Fails every 2 mo.Transparent failover in 30 sec.Online software upgradesSo… 99.999% in backend…

  26. eBay: A very honest site http://www2.ebay.com/aw/announce.shtml#top • Publishes operations log. • Has 99% of scheduled uptime • Schedules about 2 hours/week down. • Has had some operations outages • Has had some DOS problems.

  27. Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations

  28. Not to throw stones but… • Everyone has a serious problem. • The BEST people publish their stats. • The others HIDE their stats (check Netcraft to see who I mean). • We have good NODE-level availability 5-9s is reasonable. • We have TERRIBLE system-level availability 2-9s is the goal.

  29. Recommendation #1 • Continue progress on back-ends. • Make management easier (AUTOMATE IT!!!) • Measure • Compare best practices • Continue to look for better algoritims. • Live in fear • We are at 10,000 node servers • We are headed for 1,000,000 node servers

  30. Recommendation #2 • Current security approach is unworkable: • Anonymous clients • Firewall is clueless • Incredible complexity • We cant win this game! • So change the rules (redefine the problem): • No anonymity • Unified authentication/authorization model • Single-function devices (with simple interfaces) • Only one-kind of interface (uddi/wsdl/soap/…).

  31. References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9 http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent. http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!

More Related