An Overlay Infrastructure for Decentralized Object Location and Routing

An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhaoravenben@eecs.berkeley.edu University of California at Berkeley Computer Science Division

Peer-based Distributed Computing • Cooperative approach to large-scale applications • peer-based: available resources scale w/ # of participants • better than client/server: limited resources & scalability • Large-scale, cooperative applications are coming • content distribution networks (e.g. FastForward) • large-scale backup / storage utilities • leverage peers’ storage for higher resiliency / availability • cooperative web caching • application-level multicast • video on-demand, streaming movies ravenben@eecs.berkeley.edu

What Are the Technical Challenges? • File system: replicate files for resiliency/performance • how do you find close by replicas? • how does this scale to millions of users? billions of files? ravenben@eecs.berkeley.edu

Node Membership Changes • Nodes join and leave the overlay, or fail • data or control state needs to know about available resources • node membership management a necessity ravenben@eecs.berkeley.edu

A Fickle Internet • Internet disconnections are not rare (UMichTR98,IMC02) • TCP retransmission is not enough, need route-around • IP route repair takes too long: IS-IS  5s, BGP  3-15mins • good end-to-end performance requires fast response to faults ravenben@eecs.berkeley.edu

FastForward SETI Yahoo IM data location efficient, scalable data location data location data location dynamic membership dynamic node membership algorithms dynamic membership reliable comm. dynamic membership reliable communication reliable comm. reliable comm. An Infrastructure Approach • First generation of large-scale apps: vertical approach • Hard problems, difficult to get right • instead, solve common challenges once • build single overlay infrastructure at application layer overlay application presentation session transport network link physical Internet ravenben@eecs.berkeley.edu

service discovery service XSet lightweight XML DB PRR 97 Tapestry 5000+ downloads Mobicom 99 resilient overlay routing structuredoverlay APIs robust dynamic algorithms WAN deployment (1500+ downloads) IPTPS 02 a p p l i c a t i o n s multicast (Bayeux) rapid mobility(Warp) spam filtering(SpamWatch) file system (Oceanstore) landmark routing(Brocade) NOSSDAV 02 Middleware 03 IPTPS 04 ASPLOS99/FAST03 Personal Research Roadmap TSpaces DOLR SPAA 02 / TOCS ICNP 03 IPTPS 03 modeling of non-stationary datasets JSAC 04 ravenben@eecs.berkeley.edu

Talk Outline • Motivation • Decentralized object location and routing • Resilient routing • Tapestry deployment performance • Wrap-up ravenben@eecs.berkeley.edu

What should this infrastructure look like? here is one appealing direction…

Structured Peer-to-Peer Overlays • Node IDs and keys from randomized namespace (SHA-1) • incremental routing towards destination ID • each node has small set of outgoing routes, e.g. prefix routing • log (n) neighbors per node, log (n) hops between any node pair ID: ABCE ABC0 To: ABCD AB5F A930 ravenben@eecs.berkeley.edu

Related Work • Unstructured Peer to Peer Approaches • Napster, Gnutella, KaZaa • probabilistic search (optimized for the hay, not the needle) • locality-agnostic routing (resulting in high network b/w costs) • Structured Peer to Peer Overlays • the first protocols (2001): Tapestry, Pastry, Chord, CAN • then: Kademlia, SkipNet, Viceroy, Symphony, Koorde, Ulysseus… • distinction: how to choose your neighbors • Tapestry, Pastry: latency-optimized routing mesh • distinction: application interface • distributed hash table: put (key, data); data = get (key); • Tapestry: decentralized object location and routing ravenben@eecs.berkeley.edu

Defining the Requirements • efficient routing to nodes and data • low routing stretch (ratio of latency to shortest path distance) • flexible data location • applications want/need to control data placement • allows for application-specific performance optimizations • directory interface publish (ObjID), RouteToObj(ObjID, msg) • resilient and responsive to faults • more than just retransmission, route around failures • reduce negative impact (loss/jitter) on the application ravenben@eecs.berkeley.edu

backbone Decentralized Object Location & Routing routeobj(k) • redirect data traffic using log(n) in-network redirection pointers • average # of pointers/machine: log(n) * avg files/machine • keys to performance • proximity-enabled routing mesh with routing convergence routeobj(k) k publish(k) k ravenben@eecs.berkeley.edu

Why Proximity Routing? • Fewer/shorter IP hops: shorter e2e latency, less bandwidth/congestion, less likely to cross broken/lossy links 01234 01234 ravenben@eecs.berkeley.edu

Performance Impact (Proximity) • Simulated Tapestry w/ and w/o proximity on 5000 node transit-stub network • Measure pair-wise routing stretch between 200 random nodes ravenben@eecs.berkeley.edu

DOLR vs. Distributed Hash Table • DHT: hash content  name  replica placement • modifications  replicating new version into DHT • DOLR: app places copy near requests, overlay routes msgs to it ravenben@eecs.berkeley.edu

Performance Impact (DOLR) • simulated Tapestry w/ DOLR and DHT interfaces on 5000 node T-S • measure route to object latency from clients in 2 stub networks • DHT: 5 object replicas DOLR: 1 replica placed in each stub network ravenben@eecs.berkeley.edu

Talk Outline • Motivation • Decentralized object location and routing • Resilient and responsive routing • Tapestry deployment performance • Wrap-up ravenben@eecs.berkeley.edu

How do you get fast responses to faults? Response time = fault-detection + alternate path discovery+ time to switch

Fast Response via Static Resiliency • Reducing fault-detection time • monitor paths to neighbors with periodic UDP probes • O(log(n)) neighbors: higher frequency w/ low bandwidth • exponentially weighted moving average for link quality estimation • avoid route flapping due to short term loss artifacts • loss rate: Ln = (1 - )  Ln-1 +   p • Eliminate synchronous backup path discovery • actively maintain redundant paths, redirect traffic immediately • repair redundancy asynchronously • create and store backups at node insertion • restore redundancy via random pair-wise queries after failures • End result • fast detection + precomputed paths = increased responsiveness ravenben@eecs.berkeley.edu

Routing Policies • Use estimated overlay link quality to choose shortest “usable” link • Use shortest overlay link withminimal quality > T • Alternative policies • prioritize low loss over latency • use least lossy overlay link • use path w/ minimal “cost function”cf = x latency + y loss rate ravenben@eecs.berkeley.edu

Tapestry, a DOLR Protocol • Routing based on incremental prefix matching • Latency-optimized routing mesh • nearest neighbor algorithm (HKRZ02) • supports massive failures and large group joins • Built-in redundant overlay links • 2 backup links maintained w/ each primary • Use “objects” as endpoints for rendezvous • nodes publish names to announce their presence • e.g. wireless proxy publishes nearby laptop’s ID • e.g. multicast listeners publish multicast session name to self organize ravenben@eecs.berkeley.edu

0120 00XX 010X 0121 1XXX 011X 02XX 0122 2XXX 013X 3XXX 03XX XXXX 0XXX 01XX 012X ID = 0123 Weaving a Tapestry • inserting node (0123) into network • route to own ID, find 012X nodes, fill last column • request backpointers to 01XX nodes • measure distance, add to rTable • prune to nearest K nodes • repeat 2—4 Existing Tapestry ravenben@eecs.berkeley.edu

Implementation Performance • Java implementation • 35000+ lines in core Tapestry, 1500+ downloads • Micro-benchmarks • per msg overhead: ~ 50s, most latency from byte copying • performance scales w/ CPU speedup • 5KB msgs on P-IV 2.4Ghz: throughput ~ 10,000 msgs/sec • Routing stretch • route to node: < 2 • route to objects/endpoints: < 3higher stretch for close by objects ravenben@eecs.berkeley.edu

660 300 Responsiveness to Faults (PlanetLab) • = 0.2 • = 0.4 • B/W  network size N, N=300  7KB/s/node, N=106 20KB/s • sim: if link failure < 10%, can route around 90% of survivable failures ravenben@eecs.berkeley.edu

killnodes constantchurn largegroup join success rate (%) Stability Under Membership Changes • Routing operations on 40 node Tapestry cluster • Churn: nodes join/leave every 10 seconds, average lifetime = 2mins ravenben@eecs.berkeley.edu

Lessons and Takeaways • Consider system constraints in algorithm design • limited by finite resources (e.g. file descriptors, bandwidth) • simplicity wins over small performance gains • easier adoption and faster time to implementation • Wide-area state management (e.g. routing state) • reactive algorithm for best-effort, fast response • proactive periodic maintenance for correctness • Naïve event programming model is too low-level • much code complexity from managing stack state • important for protocols with asychronous control algorithms • need explicit thread support for callbacks / stack management ravenben@eecs.berkeley.edu

Future Directions • Ongoing work to explore p2p application space • resilient anonymous routing, attack resiliency • Intelligent overlay construction • router-level listeners allow application queries • efficient meshes, fault-independent backup links, failure notify • Deploying and measuring a lightweight peer-based application • focus on usability and low overhead • p2p incentives, security, deployment meet the real world • A holistic approach to overlay security and control • p2p good for self-organization, not for security/ management • decouple administration from normal operation • explicit domains / hierarchy for configuration, analysis, control ravenben@eecs.berkeley.edu

Thanks! Questions, comments? ravenben@eecs.berkeley.edu

? A ? B ? C Network ? ? ? ? Impact of Correlated Events • web / application servers • independent requests • maximize individual throughput + + = event handler • correlated requests: A+B+CD • e.g. online continuous queries, sensor aggregation, p2p control layer, streaming data mining ravenben@eecs.berkeley.edu

Some Details • Simple fault detection techniques • periodically probe overlay links to neighbors • exponentially weighted moving average for link quality estimation • avoid route flapping due to short term loss artifacts • loss rate: Ln = (1 - )  Ln-1 +   p • p = instantaneous loss rate,  = filter constant • other techniques topics of open research • How do we get and repair the backup links? • each hop has flexible routing constraint • e.g. in prefix routing, 1st hop just requires 1 fixed digit • backups always available until last hop to destination • create and store backups at node insertion • restore redundancy via random pair-wise queries after failures • e.g. to replace 123X neighbor, talk to local 12XX neighbors ravenben@eecs.berkeley.edu

Route Redundancy (Simulator) • Simulation of Tapestry, 2 backup paths per routing entry • 2 backups: low maintenance overhead, good resiliency ravenben@eecs.berkeley.edu

Another Perspective on Reachability Portion of all pair-wise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully FRLS finds path, where short-term IP routing fails ravenben@eecs.berkeley.edu

applications application programming interface Dynamic Tap. core router Patchwork distance map network SEDA event-driven framework Java Virtual Machine Single Node Software Architecture ravenben@eecs.berkeley.edu

Related Work • Unstructured Peer to Peer Applications • Napster, Gnutella, KaZaa • probabilistic search, difficult to scale, inefficient b/w • Structured Peer to Peer Overlays • Chord, CAN, Pastry, Kademlia, SkipNet, Viceroy, Symphony, Koorde, Coral, Ulysseus, … • routing efficiency • application interface • Resilient routing • traffic redirection layers • Detour, Resilient Overlay Networks (RON), Internet Indirection Infrastructure (I3) • our goals: scalability, in-network traffic redirection ravenben@eecs.berkeley.edu

Node to Node Routing (PlanetLab) Median=31.5, 90th percentile=135 • Ratio of end-to-end latency to ping distance between nodes • All node pairs measured, placed into buckets ravenben@eecs.berkeley.edu

Object Location (PlanetLab) 90th percentile=158 • Ratio of end-to-end latency to client-object ping distance • Local-area stretch improved w/ additional location state ravenben@eecs.berkeley.edu

100mb/s Micro-benchmark Results (LAN) • Per msg overhead ~ 50s, latency dominated by byte copying • Performance scales with CPU speedup • For 5K messages, throughput = ~10,000 msgs/sec ravenben@eecs.berkeley.edu

register register get (hash(B)) P’(B) put (hash(B), P’(B)) put (hash(A), P’(A)) Traffic Tunneling Legacy Node B Legacy Node A B P’(B) A, B are IP addresses Proxy Proxy Structured Peer to Peer Overlay • Store mapping from end host IP to its proxy’s overlay ID • Similar to approach in Internet Indirection Infrastructure (I3) ravenben@eecs.berkeley.edu

Constrained Multicast • Used only when all paths are below quality threshold • Send duplicate messages on multiple paths • Leverage route convergence • Assign unique message IDs • Mark duplicates • Keep moving window of IDs • Recognize and drop duplicates • Limitations • Assumes loss not from congestion • Ideal for local area routing 2225 2299 2274 2286 2046 2281 2530 ? ? ? 1111 ravenben@eecs.berkeley.edu

Link Probing Bandwidth (PL) • Bandwidth increases logarithmically with overlay size • Medium sized routing overlays incur low probing bandwidth ravenben@eecs.berkeley.edu

An Overlay Infrastructure for Decentralized Object Location and Routing