450 likes | 538 Views
Explore the structure and technology of large computing systems, with insights on clusters, data centers, replication, and fault tolerance. Learn about cutting-edge research on scalability and fault management in distributed computing environments.
E N D
Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University
Members of the group • Ken Birman • Robbert van Renesse • Einar Vollset • Krzystof Ostrowski • Mahesh Balakrishnan • Maya Haridasan • Amar Phanishayee
Our topic • Computing systems are growing • … larger, • … and more complex, • … and we are hoping to use them in a more and more “unattended” manner • Peek under the covers of the toughest, most powerful systems that exist • Then ask: Can we discern a research agenda?
Some “factoids” • Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines • Credit card companies, banks, brokerages, insurance companies close behind • Rate of growth is staggering • Meanwhile, a new rollout of wireless sensor networks is poised to take off
How are big systems structured? • Typically a “data center” of web servers • Some human-generated traffic • Some automatic traffic from WS clients • The front-end servers are connected to a pool of clustered back-end application “services” • All of this load-balanced, multi-ported • Extensive use of caching for improved performance and scalability • Publish-subscribe very popular
LB LB LB LB LB LB service service service service service service A glimpse inside eStuff.com “front-end applications” Pub-sub combined with point-to-pointcommunication technologies like TCP
Hierarchy of sets • A set of data centers, each having • A set of services, each structured as • A set of partitions, each consisting of • A set of programs running in a clustered manner on • A set of machines … raising the obvious question: how well do platforms support hierarchies of sets?
x y z A RAPS of RACS (Jim Gray) • RAPS: A reliable array of partitioned subservices • RACS: A reliable array of cloned server processes A set of RACS RAPS Pmap “B-C”: {x, y, z} (equivalent replicas) Here, y gets picked, perhaps based on load Ken Birman searching for “digital camera”
Services are hosted at data centers but accessible system - wide Data center B Data center A Query source Update source pmap pmap pmap Logical partitioning of services l2P map Logical services map to a physical Server pool resource pool, perhaps many to one Operators can control pmap , l2P map, other parameters. Large - scale multicast used to disseminate updates RAPS of RACS in Data Centers
Technology needs? • Programs will need a way to • Find the “members” of the service • Apply the partitioning function to find contacts within a desired partition • Dynamic resource management, adaptation of RACS size and mapping to hardware • Fault detection • Within a RACS we also need to: • Replicate data for scalability, fault tolerance • Load balance or parallelize tasks
Membership Within RACS Of the service Services in data centers Communication Point-to-point Multicast Resource management Pool of machines Set of services Subdivision into RACS Fault-tolerance Consistency Scalability makes this hard!
… hard in what sense? • Sustainable workload often drops at least linearly in system size • And this happens because overheads grow worse than linearly (quadratic is common) • Reasons vary… but share a pattern: • Frequency of “disruptive” events rises with scale • Protocols have property that whole system is impacted when these events occur
QuickSilver project • We’ve been building a scalable infrastructure addressing these needs • Consists of: • Some existing technologies, notably Astrolabe, gossip “repair” protocols • Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications
Gossip 101 • Suppose that I know something • I’m sitting next to Fred, and I tell him • Now 2 of us “know” • Later, he tells Mimi and I tell Anne • Now 4 • This is an example of a push epidemic • Push-pull occurs if we exchange data
Gossip scales very nicely • Participants’ loads independent of size • Network load linear in system size • Information spreads in log(system size) time 1.0 % infected 0.0 Time
Gossip in distributed systems • We can gossip about membership • Need a bootstrap mechanism, but then discuss failures, new members • Gossip to repair faults in replicated data • “I have 6 updates from Charlie” • If we aren’t in a hurry, gossip to replicate data too
Bimodal Multicast Gossip source has a message from Mimi that I’m missing. And he seems to be missing two messages from Charlie that I have. Here are some messages from Charlie that might interest you. Could you send me a copy of Mimi’s 7’th message? Send multicasts to report events Periodically, but not synchronously, gossip about messages. Mimi’s 7’th message was “The meeting of our Q exam study group will start late on Wednesday…” Some messages don’t get through ACM TOCS 1999
Most members are healthy…. … but one is slow Stock Exchange Problem: Reliable multicast is too “fragile” Most members are healthy….
32 96 The problem gets worse as the system scales up Virtually synchronous Ensemble multicast protocols 250 group size: 32 group size: 64 group size: 96 200 150 average throughput on nonperturbed members 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 perturb rate
Bimodal multicast with perturbed processes Bimodal multicastscales well Traditional multicast: throughput collapses under stress
Bimodal Multicast • Imposes a constant overhead on participants • Many optimizations and tricks needed, but nothing that isn’t practical to implement • Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links • Reliability is easy to analyze mathematically using epidemic theory • Use the theory to derive optimal parameter setting • Theory also let’s us predict behavior • Despite simplified model, the predictions work!
Kelips • A distributed “index” • Put(“name”, value) • Get(“name”) • Kelips can do lookups with one RPC, is self-stabilizing after disruption
Kelips Take a a collection of “nodes” 110 230 202 30
- N N 1 Kelips Map nodes to affinity groups Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30
- N N 1 Kelips 110 knows about other members – 230, 30… Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30 Affinity group pointers
- N N 1 Kelips 202 is a “contact” for 110 in group 2 Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 Contacts 230 202 members per affinity group 30 Contact pointers
- N N 1 Kelips “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it. Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 Contacts 230 202 members per affinity group 30 Resource Tuples Gossip protocol replicates data cheaply
- N N 1 Kelips To look up “cnn.com”, just ask some contact in group 2. It returns “110” (or forwards your request). Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30 IP2P, ACM TOIS (submitted)
Kelips • Per-participant loads are constant • Space required grows as O(√N) • Finds an object in “one hop” • Most other DHTs need log(N) hops • And isn’t disrupted by churn, either • Most other DHTs are seriously disrupted when churn occurs and might even “fail”
Astrolabe: Distributed Monitoring 1.9 2.1 1.8 3.1 0.9 0.8 1.1 5.3 3.6 2.7 • Row can have many columns • Total size should be k-bytes, not megabytes • Configuration certificate determines what data is pulled into the table (and can change) ACM TOCS 2003
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
Scaling up… and up… • With a stack of domains, we don’t want every system to “see” every domain • Cost would be huge • So instead, we’ll see a summary cardinal.cs.cornell.edu
Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide SQL query “summarizes” data New Jersey San Francisco
(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy 1 1 3 3 2 2 New Jersey San Francisco
Hierarchy is virtual… data is replicated New Jersey San Francisco ACM TOCS 2003
Astrolabe • Load on participants, in worst case, grows as logrsize(N) • Most partipants see a constant, low load • Incredibly robust, self-repairing • Information visible in log time • And can reconfigure or change aggregation query in log time, too • Well matched to data mining
QuickSilver: Current work • One goal is to offer scalable support for: • Publish(“topic”, data) • Subscribe(“topic”, handler) • Topic associated w/ protocol stack, properties • Many topics… hence many protocol stacks (communication groups) • Quicksilver scalable multicast is running now and demonstrates this capability in a web services framework • Primary developer is Krzys Ostrowski
Tempest • This project seeks to automate a new drag-and-drop style of clustered application development • Emphasis is on time-critical response • You start with a relatively standard web service application having good timing properties (inheriting from our data class) • Tempest automatically clones services, places them, load-balances, repairs faults • Uses Ricochet protocol for time-critical multicast
Ricochet • Core protocol underlying Tempest • Delivers a multicast with • Probabilistically strong timing properties • Three orders of magnitude faster than prior record! • Probability-one reliability, if desired • Key idea is to use FEC and to exploit patterns of numerous, heavily overlapping groups. • Available for download from Cornell as a library (coded in Java)
Our system will be used in… • Massive data centers • Distributed data mining • Sensor networks • Grid computing • Air Force “Services Infosphere”
Next major project? • We’re starting a completely new effort • Goal is to support a new generation of mobile platforms that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication • Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…
Summary • Our project builds software • Software that real people will end up running • But we tell users when it works and prove it! • The focus lately is on scalability and QoS • Theory, engineering, experiments and simulation • For scalability, set probabilistic goals, use epidemic protocols • But outcome will be real systems that we believe will be widely used.