1 / 37

Lessons from Giant-Scale Services IEEE Internet Computing,  Vol. 5, No. 4., July/August 2001

Lessons from Giant-Scale Services IEEE Internet Computing,  Vol. 5, No. 4., July/August 2001. Eric A. Brewer University of California, Berkeley, and Iktomi Corporation Παρουσίαση: Ηλίας Τσιγαρίδας (Μ484). Examples of Giant-scale services. Aol Microsoft network Yahoo eBay CNN

dava
Download Presentation

Lessons from Giant-Scale Services IEEE Internet Computing,  Vol. 5, No. 4., July/August 2001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lessons from Giant-Scale ServicesIEEE Internet Computing,  Vol. 5, No. 4., July/August 2001 Eric A. Brewer University of California, Berkeley, and Iktomi Corporation Παρουσίαση: Ηλίας Τσιγαρίδας (Μ484)

  2. Examples of Giant-scale services • Aol • Microsoft network • Yahoo • eBay • CNN • Instant messaging • Napster • Many more… The demand They must be always available, despite their scale, growth rate, rapid evolution of content and features, etc

  3. Characteristics “Experience” article No literature points Principles approaches Not quantitative evaluation The reasons Focusing on high level design New area Proprietary nature of the information Article Characteristics

  4. Article scope • Look at the Basic Model of the giant-scale services • Focusing the challenges of • High availability • Evolution • Growth • Principles for the above Simplify the design of large systems

  5. Basic Model (general) • The “infrastructure services” • Internet-based systems that provide instant messaging, wireless services and so on

  6. We discuss Single-site Single-owner Well-connected cluster Perhaps a part of a larger service We do not discuss Wide are issues Network partitioning Low or discontinuous bandwidth Multiple admistrative domains Service monitoring Network QoS Security Log and logging analysis DBMS Basic Model (general)

  7. Basic Model (general) • We focus on • High availability • Replication • Degradation • Disaster tolerance • Online evolution The scope is bridging the gap between the basic building block of giant-scale services and the real world scalability and availability they require

  8. Basic Model (Advantages) • Access anywhere, anytime • Availability via multiple devices • Groupware support • Lower overall cost • Simplified service updates

  9. Basic Model (Advantages) • Access anywhere, anytime • The infrastructure is ubiquitous • You can access the service from home, work airport and so on

  10. Basic Model (Advantages) • Availability via multiple devices • The infrastructure handles the processing (the most at least) • User access the services via set-top boxes, networkscomputer, smartphones and so on • In that way we have offer more functionality for a given cost and battery life

  11. Basic Model (Advantages) • Groupware support • Centralizing data from many users allowing group-ware application like • Calendar • Teleconferencing systems, and so on

  12. Basic Model (Advantages) • Lower overall cost • Hard to measure overall cost but • Infrastructure services have an advantage over designs based on stand alone devices • High utilization • Centralize administration reduce the cost, but harder to quantify

  13. Basic Model (Advantages) • Simplified service updates • Updates without physical distribution • The most powerful long term advantage

  14. Basic Model (Components)

  15. Basic Model (Assumptions) • The service provider has limited control over the clients an the IP network • Queries drive the service • Read only queries outnumber greatly update queries • Giant-scale services use CLUSTERS

  16. Basic Model (Components) • Clients, such as Web browsers. Initiate the queries to the services • IP network, public Internet or a private network. Provides access to the service. • Load manager, provides indirection between the service’s external name and the servers’ physical names (IP addresses). Load balancing. Proxies or firewalls before the load manager. • Servers. Combining CPU, memory, and disks into an easy-to-replicate unit. • Persistent data store, replicated or partitioned database spread across the servers. Optional external DBMSs or RAID storage.  • Backplane. Optional. Handles inter-server traffic.

  17. Basic Model (Load Management) • Round Robin DNS • “Layer-4” switches • understand TCP and port numbers • “Layer-7” switches • parses URL • Custom “front-end” nodes • They act like service specific “layer-7” routers • Include the clients in the load balancing • Ex alternative DNS or Name Server

  18. Basic Model (Load Management) • Two opposite approaches • Simple Web Farm • Search engine cluster

  19. Basic Model (Load Management)Simple Web Farm

  20. Basic Model (Load Management)Search engine cluster

  21. High Availability (general) • Like telephone, rail or water systems • Features • Extreme symmetry • No people • Few cables • No external disks • No monitors • Inkotomi in addition • Manages the cluster offline • Limit temperature and power variations

  22. High Availability (metrics)

  23. High Availability (DQ principle) • The systems overall capacity has a particular physical bottleneck • Ex. Total I/O bandwidth, total seeks per second • Total amount of data to be moved per second • Measurable and tunable • Ex. adding nodes, software optimization OR faults

  24. High Availability (DQ principle) • Focus on the relative DQ value, not on the absolute • Define the DQ value of your system • Normally DQ values scales linearly with the number of the nodes

  25. High Availability (DQ principle) • Analyzing the faults impact • Focus on how DQ reduction influence the three metrics • Only for data-intensive sites

  26. Replication 100% harvest 50% yield DQ -= 50% Maintain D Reduce Q Partitioning 50% Harvest 100% yield DQ -= 50% Reduce D Maintain Q High AvailabilityReplication vs. Partitioning Example: 2-node cluster. One down

  27. High AvailabilityReplication

  28. High AvailabilityReplication vs. Partitioning • Replication wins if the bandwidth is the same. • Extra cost is on the bandwidth not on the disks • Easy recovering • We might also use partial replication and randomization

  29. High AvailabilityGraceful degradation • We can not avoid saturation, because • Peak-to-average ratio 1.6:1 to 6:1. Expensive to build capacity above the (normal) peak • Single events burst (ex. Online ticket sales for special events) • Faults like power failures or natural disaster affect substantially the overall DQ and the remaining nodes become saturated. So, we MUST have mechanisms for degradation

  30. High AvailabilityGraceful degradation • The DQ principle give us the options for • Limit Q (capacity) to maintain D • Reduce D and increase Q • Focus on harvest by Admission Control (AC) • Reduce Q • Reduce D on dynamic databases • Both • Cut the effective database to half (new approach)

  31. High AvailabilityGraceful degradation • More sophisticated techniques • Cost based AC • Estimate query cost • Reduce the data per query • Augment Q • Priority (or value) based AC • Drop low-valued queries • Ex execute stock trade within 60s or the user pays no commission • Reduced data freshness • Reduce the freshness so reduce the work per query • Increase yield at the expense of harvest

  32. High AvailabilityDisaster Tolerance • Combination of managing replicas and graceful degradation • How many locations? • How many replicas on each location? • Load management • “Layer-4” switch do not help with the loss of a whole cluster • Smart clients is the solution

  33. Online Evolution & Growth • We must plan for continuous growth and frequent functionality updates • Maintenance and upgrades are controlled failures • Total loss of DQ value is • ΔDQ = n · u · average DQ/node = DQ · u • Where n is the number of nodes and u the total amount per time a node requires for online evolution

  34. Online Evolution & GrowthThree approaches An example for a 4-node cluster

  35. Get the basics right Professional data center, layer-7 switch, symmetry Decide on your availability metrics Everyone must agree on the goals Harvest and yield > uptime Focus on MTTR at least as much as MTBF MTTR is easier and has the same impact Understand load redirection during faults Data replication is insufficient, you need excess DQ ConclusionsThe basic lessons learned

  36. ConclusionsThe basic lessons learned • Graceful degradation is a critical part • Intelligent admission control and dynamic database reduction • Use DQ analysis on all upgrades • Capacity planning • Automate upgrades as much as possible • Have a fast simple way to return to older version

  37. Final Statement • Smart clients could simplify all of the above

More Related