Navigating in the Dark: New Options for Building Self-Configuring Embedded Systems

Navigating in the Dark:New Options for Building Self-Configuring Embedded Systems Ken Birman Cornell University

A sea change… • We’re looking at a massive change in the way we use computers • Today, we’re still very client-server oriented (Web Services will continue this) • Tomorrow, many important applications will use vast numbers of small sensors • And even standard wired systems should sometimes be treated as sensor networks

Characteristics? • Large numbers of components, substantial rate of churn • … failure is common in any large deployment, small nodes are fragile (flawed assumption?) and connectivity may be poor • Need to self configure • For obvious, practical reasons

Can we map this problem to the previous one? • Not clear how we could do so • Sensors can often capture lots of data (think: Foveon 6Mb optical chip, 20 fps) • … and may even be able to process that data on chip • But rarely have the capacity to ship the data to a server (power, signal limits)

This spells trouble! • The way we normally build distributed systems is mismatched to the need! • The clients of a Web Services or similar system are second-tier citizens • And they operate in the dark • About one-another • And about network/system “state” • Building sensor networks this way won’t work

Is there an alternative? • We see hope in what are called peer-to-peer and “epidemic” communications protocols! • Inspired by work on (illegal) file sharing • But we’ll aim at other kinds of sharing • Goals: scalability, stability despite churn, load loads/power consumption • Most overcome tendency of many P2P technologies to be disabled by churn

Astrolabe Intended as help for applications adrift in a sea of information Structure emerges from a randomized peer-to-peer protocol This approach is robust and scalable even under extreme stress that cripples more traditional approaches Developed at Cornell By Robbert van Renesse, with many others helping… Just an example of the kind of solutions we need Astrolabe

Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide SQL query “summarizes” data New Jersey San Francisco

Astrolabe in a single domain • Each node owns a single tuple, like the management information base (MIB) • Nodes discover one-another through a simple broadcast scheme (“anyone out there?”) and gossip about membership • Nodes also keep replicas of one-another’s rows • Periodically (uniformly at random) merge your state with some else…

State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu

Observations • Merge protocol has constant cost • One message sent, received (on avg) per unit time. • The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so • Information spreads in O(log N) time • But this assumes bounded region size • In Astrolabe, we limit them to 50-100 rows

Big system will have many regions • Astrolabe usually configured by a manager who places each node in some region, but we are also playing with ways to discover structure automatically • A big system could have many regions • Looks like a pile of spreadsheets • A node only replicates data from its neighbors within its own region

Scaling up… and up… • With a stack of domains, we don’t want every system to “see” every domain • Cost would be huge • So instead, we’ll see a summary cardinal.cs.cornell.edu

Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide SQL query “summarizes” data New Jersey San Francisco

Large scale: “fake” regions • These are • Computed by queries that summarize a whole region as a single row • Gossiped in a read-only manner within a leaf region • But who runs the gossip? • Each region elects “k” members to run gossip at the next level up. • Can play with selection criteria and “k”

Hierarchy is virtual… data is replicated New Jersey San Francisco

Worst case load? • A small number of nodes end up participating in O(logfanoutN) epidemics • Here the fanout is something like 50 • In each epidemic, a message is sent and received roughly every 5 seconds • We limit message size so even during periods of turbulence, no message can become huge. • Instead, data would just propagate slowly • Haven’t really looked hard at this case

Astrolabe is a good fit • No central server • Hierarchical abstraction “emerges” • Moreover, this abstraction is very robust • It scales well… disruptions won’t disrupt the system … consistent in eyes of varied beholders • Individual participant runs trivial p2p protocol • Supports distributed data aggregation, data mining. Adaptive and self-repairing…

Data Mining • We can data mine using Astrolabe • The “configuration” and “aggregation” queries can be dynamically adjusted • So we can use the former to query sensors in customizable ways (in parallel) • … and then use the latter to extract an efficient summary that won’t cost an arm and a leg to transmit

Costs are basically constant! • Unlike many systems that experience load surges and other kinds of load variability, Astrolabe is stable under all conditions • Stress doesn’t provoke surges of message traffic • And Astrolabe remains active even while those disruptions are happening

Other such abstractions • Scalable probabilistically reliable multicast based on P2P (peer-to-peer) epidemics: Bimodal Multicast • Some of the work on P2P indexing structures and file storage: Kelips • Overlay networks for end-to-end IP-style multicast: Willow

Challenges • We need to do more work on • Real-time issues: right now Astrolabe is highly predictable but somewhat slow • Security (including protection against malfunctioning components) • Scheduled downtown (sensors do this quite often today; maybe less an issue in the future)

Communication locality • Important in sensor networks, where messages to distant machines need costly relaying • Astrolabe does most of its communication with neighbors • Close to Kleinberg’s small worlds structure for remote gossip

Conclusions? • We’re near a breakthrough: sensor networks that behave like sentient infrastructure • They sense their own state and adapt • Self-configure and self-repair • Incredibly exciting opportunities ahead • Cornell has focused on probabilistically scalable technologies and built real systems while also exploring theoretical analyses

Extra slides • Just to respond to questions

Bimodal Multicast • A technology we developed several years ago • Our goal was to get better scalability without abandoning reliability

Multicast historical timeline TIME • Cheriton: V system • IP multicast is a standard Internet protocol • Anycast never really made it 1980’s: IP multicast, anycast, other best-effort models

Multicast historical timeline TIME • Isis Toolkit was used by • New York Stock Exchange, Swiss Exchange • French Air Traffic Control System • AEGIS radar control and communications 1980’s: IP multicast, anycast, other best-effort models 1987-1993: Virtually synchronous multicast takes off (Isis, Horus, Ensemble but also many other systems, like Transis, Totem, etc). Used in many settings today but no single system “won”

Multicast historical timeline TIME 1980’s: IP multicast, anycast, other best-effort models 1987-1993: Virtually synchronous multicast takes off (Isis, Horus, Ensemble but also many other systems, like Transis, Totem, etc). Used in many settings today but no single system “won” 1995-2000: Scalability issues prompt a new generation of scalable solutions (SRM, RMTP, etc). Cornell’s contribution was Bimodal Multicast, aka “pbcast”

Virtual Synchrony Model G0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t} p q r s t crash r, s request to join p fails r,s added; state xfer t requests to join t added, state xfer ... to date, the only widely adopted model for consistency and fault-tolerance in highly available networked applications

Virtual Synchrony scaling issue Virtually synchronous Ensemble multicast protocols 250 group size: 32 group size: 64 group size: 96 200 150 average throughput on nonperturbed members 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 perturb rate

Bimodal Multicast • Uses some sort of best effort dissemination protocol to get the message “seeded” • E.g. IP multicast, or our own tree-based scheme running on TCP but willing to drop packets if congestion occurs • But some nodes log messages • We use a DHT scheme • Detect a missing message? Recover from a log server that should have it…

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

Figure 5: Graphs of analytical results

Our original degradation scenario

Distributed Indexing • Goal is to find a copy of Nora Jones’ “I don’t know” • Index contains, for example • (machine-name, object) • Operations to search, update

History of the problem • Very old: Internet DNS does this • But DNS lookup is by machine name • We want the inverted map • Napster was the first really big hit • 5 million users at one time • The index itself was centralized. • Used peer-to-peer file copying once a copy was found (many issues arose…)

Hot academic topic today • System based on a virtual ring • MIT: Chord system (Karger, Kaashoek) • Many systems use Paxton radix search • Rice: Pastry (Druschel, Rowston) • Berkeley: Tapestry • Cornell: a scheme that uses replication • Kelips

Kelips idea? • Treat the system as sqrt(N) “affinity” groups of size sqrt(N) each • Any given index entry is mapped to a group and replicated within it • O(log N) time delay to do an update • Could accelerate this with an unreliable multicast • To do a lookup, find a group member (or a few of them) and ask for item • O(1) lookup

Why Kelips? • Other schemes have O(log N) lookup delay • This is quite a high cost in practical settings • Others also have fragile data structures • Background reorganization costs soar under stress, churn, flash loads • Kelips has a completely constant load!

Solutions that share properties • Scalable • Robust against localized disruption • Have emergent behavior we can reason about, exploit in the application layer • Think of the way a hive of insects organizes itself or reacts to stimuli. There are many similarities

Revisit our goals • Are these potential components for sentient systems? • Middleware that perceives the state of the network • It represent this knowledge in a form smart applications can exploit • Although built from large numbers of rather dumb components the emergent behavior is intelligent. These applications are more robust, more secure, more responsive than any individual component • When something unexpected occurs, they can diagnose the problemand trigger a coordinated distributed response • They repair themselvesafter damage • We seem to have the basis from which to work!

Brings us full circle • Our goal should be a new form of very stable “sentient middleware” • Have we accomplished this goal? • Probabilistically reliable, scalable primitives • They solve many problems • Gaining much attention now from industry, academic research community • Fundamental issue is skepticism about peer-to-peer as a computing model

Conclusions? • We’re at the verge of a breakthrough – networks that behave like sentient infrastructure on behalf of smart applications • Incredibly exciting opportunities if we can build these • Cornell’s angle has focused on probabilistically scalable technologies and tried to mix real systems and experimental work with stochastic analyses

Navigating in the Dark: New Options for Building Self-Configuring Embedded Systems