G22.3250-001

G22.3250-001 Distributed Data Structuresfor Internet Services Robert Grimm New York University (with some slides by Steve Gribble)

Altogether Now:The Three Questions • What is the problem? • What is new or different or notable? • What are the contributions and limitations?

Clusters, Clusters, Clusters • Let’s broaden the goals for cluster-based services • Incremental scalability • High availability • Operational manageability • And also data consistency • But what to do if the data has to be persistent? • TACC works best for read-only data • Porcupine works best for a limited group of services • Email, news, bulletin boards, calendaring

In-memory, single site application interface Persistent, distributed, replicated implementation Clean consistency model Atomic operations (but no transactions) Independent of accessing nodes (functional homogeneity) Enter Distributed Data Structures (DDS)

DDS’s as an Intermediate Design Point • Relational databases • Strong guarantees (ACID) • But also high overhead, complexity • Logical structure very much independent of physical layout • Distributed data structures • Atomic operations, one-copy equivalence • Familiar, frequently used interface: hash table, tree, log • Distributed file systems • Weak guarantees (e.g., close/open consistency) • Low-level interface with little data independence • Applications impose structure on directories, files, bytes

Design Principles • Separate concerns • Service code implements application • Storage management is reusable, recoverable • Appeal to properties of clusters • Generally secure and well-administered • Fast network, uninterruptible power • Design for high throughput and high concurrency • Use event-driven implementation • Make it easy to compose components • Make it easy to absorb bursts (in event queues)

Assumptions • No network partitions within cluster • Highly redundant network • DDS components are fail-stop • Components implemented to terminate themselves • Failures are independent • Messaging is synchronous • Bounded time for delivery • Workload has no extreme hotspots (for hash table) • Population density over key space is even • Working set of hot keys is larger than # of cluster nodes

Distributed Hash Tables(in a Cluster…)

DHT Architecture

Cluster-Wide Metadata Structures

Metadata Maps Why is two-phasecommit acceptablefor DDS’s?

Recovery

Experimental Evaluation • Cluster of 28 2-way SMPs and 38 4-way SMPs • To a total of 208 500 MHZ Pentium CPUs • 2-way SMPs: 500 MB RAM, 100 Mbs switched Ethernet • 4-way SMPs: 1 GB RAM, 1 Gbs switched Ethernet • Implementation written in Jāvā • Sun’s JDK 1.1.7v3, OpenJIT, Linux user-level threads • Load generators run within cluster • 80 nodes necessary to saturate 128 storage bricks

Scalability: Reads and Writes

Graceful Degradation (Reads)

Unexpected Imbalance (Writes) What’s going on?

Capacity

Recovery Behavior Normal GC in action | Recovery | 1 brick fails Buffer cache warm up

So, All Is Good?

Assumptions Considered Harmful! • Central insight, based on experience with DDS • “Any system that attempts to gain robustness solely through precognition is prone to fragility” • In other words • Complex systems are so complex that they are impossible to understand completely, especially when operating outside their expected range

Assumptions in Action • Bounded synchrony • Timeout four orders of magnitude higher than common case round trip time • But garbage collection may take a very long time • The result is a catastrophic drop in throughput • Independent failures • Race condition in two-phase commit caused latent memory leak (10 KB/minute under normal operation) • All bricks failed predictably within 10-20 minutes of each other • After all, they were started at about the same time • The result is a catastrophic loss of data

Assumptions in Action (cont.) • Fail-stop components • Session layer uses synchronous connect() method • Another graduate student adds firewalled machine to cluster, resulting in nodes locking up for 15 minutes at a time • The result is a catastrophic corruption of data

What Can We Do? • Systematically overprovision the system • But doesn’t that mean predicting the future, again? • Use admission control • But this can still result in livelock, only later… • Build introspection into the system • Need to easily quantify behavior in order to adapt • Close the control loop • Make the system adapt automatically (but see previous) • Plan for failures • Use transactions, checkpoint frequently, reboot proactively

What Do You Think?

G22.3250-001

G22.3250-001

Presentation Transcript

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001