Peer-to-Peer Networks

Peer-to-Peer Networks Distributed Algorithms for P2P Distributed Hash Tables P. Felber Pascal.Felber@eurecom.frhttp://www.eurecom.fr/~felber/

Agenda • What are DHTs? Why are they useful? • What makes a “good” DHT design • Case studies • Chord • Pastry (locality) • TOPLUS (topology-awareness) • What are the open problems? Peer-to-Peer Networks — P. Felber

A distributed system architecture No centralized control Typically many nodes, but unreliable and heterogeneous Nodes are symmetric in function Take advantage of distributed, shared resources (bandwidth, CPU, storage) on peer-nodes Fault-tolerant, self-organizing Operate in dynamic environment, frequent join and leave is the norm What is P2P? Internet Peer-to-Peer Networks — P. Felber

I have it I have it P2P Challenge: Locating Content • Simple strategy: expanding ring search until content is found • If r of N nodes have copy, the expected search cost is at least N / r, i.e., O(N) • Need many copies to keep overhead small Who has this paper? Peer-to-Peer Networks — P. Felber

Directed Searches • Idea • Assign particular nodes to hold particular content (or know where it is) • When a node wants this content, go to the node that is supposes to hold it (or know where it is) • Challenges • Avoid bottlenecks: distribute the responsibilities “evenly” among the existing nodes • Adaptation to nodes joining or leaving (or failing) • Give responsibilities to joining nodes • Redistribute responsibilities from leaving nodes Peer-to-Peer Networks — P. Felber

A hash table associates data with keys Key is hashed to find bucket in hash table Each bucket is expected to hold #items/#buckets items In a Distributed Hash Table (DHT), nodes are the hash buckets Key is hashed to find responsible peer node Data and load are balanced across nodes hash table lookup (key) → data insert (key, data) 0 x 1 2 y z key hash function pos 3 “Beattles” 2 ... h(key)%N N-1 hash bucket 0 lookup (key) → data insert (key, data) 1 2 key hash function pos “Beattles” 2 ... h(key)%N N-1 node Idea: Hash Tables Peer-to-Peer Networks — P. Felber

DHTs: Problems • Problem 1 (dynamicity): adding or removing nodes • With hash mod N, virtually every key will change its location! h(k)modm ≠h(k)mod (m+1) ≠h(k)mod (m-1) • Solution: use consistent hashing • Define a fixed hash space • All hash values fall within that space and do not depend on the number of peers (hash bucket) • Each key goes to peer closest to its ID in hash space (according to some proximity metric) Peer-to-Peer Networks — P. Felber

DHTs: Problems (cont’d) • Problem 2 (size): all nodes must be known to insert or lookup data • Works with small and static server populations • Solution: each peer knows of only a few “neighbors” • Messages are routed through neighbors via multiple hops (overlay routing) Peer-to-Peer Networks — P. Felber

What Makes a Good DHT Design • For each object, the node(s) responsible for that object should be reachable via a “short” path (small diameter) • The different DHTs differ fundamentally only in the routing approach • The number of neighbors for each node should remain “reasonable” (small degree) • DHT routing mechanisms should be decentralized (no single point of failure or bottleneck) • Should gracefully handle nodes joining and leaving • Repartition the affected keys over existing nodes • Reorganize the neighbor sets • Bootstrap mechanisms to connect new nodes into the DHT • To achieve good performance, DHT must provide low stretch • Minimize ratio of DHT routing vs. unicast latency Peer-to-Peer Networks — P. Felber

DHT Interface • Minimal interface (data-centric) Lookup(key)  IP address • Supports a wide range of applications, because few restrictions • Keys have no semantic meaning • Value is application dependent • DHTs do not store the data • Data storage can be build on top of DHTs Lookup(key)  data Insert(key, data) Peer-to-Peer Networks — P. Felber

DHTs in Context User Application store_file load_file File System Retrieve and store files Map files to blocks CFS store_block load_block Storage Replication Caching Reliable Block Storage DHash lookup Lookup Routing DHT Chord send receive Transport TCP/IP Communication Peer-to-Peer Networks — P. Felber

DHTs Support Many Applications • File sharing [CFS, OceanStore, PAST, …] • Web cache [Squirrel, …] • Censor-resistant stores [Eternity, FreeNet, …] • Application-layer multicast [Narada, …] • Event notification [Scribe] • Naming systems [ChordDNS, INS, …] • Query and indexing [Kademlia, …] • Communication primitives [I3, …] • Backup store [HiveNet] • Web archive [Herodotus] Peer-to-Peer Networks — P. Felber

DHT Case Studies • Case Studies • Chord • Pastry • TOPLUS • Questions • How is the hash space divided evenly among nodes? • How do we locate a node? • How does we maintain routing tables? • How does we cope with (rapid) changes in membership? Peer-to-Peer Networks — P. Felber

Circular m-bit ID space for both keys and nodes Node ID = SHA-1(IP address) Key ID = SHA-1(key) A key is mapped to the first node whose ID is equal to or follows the key ID Each node is responsible for O(K/N) keys O(K/N) keys move when a node joins or leaves N1 N56 N8 K54 K10 N51 N14 N48 N42 N21 N38 K24 K38 N32 K30 Chord (MIT) m=6 2m-1 0 Peer-to-Peer Networks — P. Felber

Basic Chord: each node knows only 2 other nodes on the ring Successor Predecessor (for ring management) Lookup is achieved by forwarding requests around the ring through successor pointers Requires O(N) hops lookup(K54) Chord State and Lookup (1) m=6 2m-1 0 N1 N56 N8 K54 N51 N14 N48 N42 N21 N38 N32 Peer-to-Peer Networks — P. Felber

Each node knows m other nodes on the ring Successors: finger i of n points to node at n+2i (or successor) Predecessor (for ring management) O(log N) state per node Lookup is achieved by following closest preceding fingers, then successor O(log N) hops lookup(K54) Chord State and Lookup (2) Finger table N8+1 N14 N8+2 N14 m=6 N8+4 N14 2m-1 0 N8+8 N21 N1 N8+16 N32 N8+32 N42 N56 N8 K54 +1 +2 N51 +4 N14 N48 +8 +16 +32 N42 N21 N38 N32 Peer-to-Peer Networks — P. Felber

Chord Ring Management • For correctness, Chord needs to maintain the following invariants • For every key k, succ(k) is responsible for k • Successor pointers are correctly maintained • Finger table are not necessary for correctness • One can always default to successor-based lookup • Finger table can be updated lazily Peer-to-Peer Networks — P. Felber

Joining the Ring • Three step process: • Initialize all fingers of new node • Update fingers of existing nodes • Transfer keys from successor to new node Peer-to-Peer Networks — P. Felber

Joining the Ring — Step 1 • Initialize the new node finger table • Locate any node n in the ring • Ask n to lookup the peers at j+20, j+21, j+22… • Use results to populate finger table of j Peer-to-Peer Networks — P. Felber

Updating fingers of existing nodes New node j calls update function on existing nodes that must point to j Nodes in the ranges[j-2i , pred(j)-2i+1] O(log N) nodes need to be updated 6 12 -16 N28 Joining the Ring — Step 2 N8+1 N14 N8+2 N14 m=6 N8+4 N14 2m-1 0 N8+8 N21 N1 N8+16 N32 N28 N8+32 N42 N56 N8 N51 N14 N48 +16 N42 N21 N38 N32 Peer-to-Peer Networks — P. Felber

N21 N21 N21 N21 N28 N28 N28 N32 N32 N32 N32 K24 K24 K24 K24 K24 K30 K30 K30 K30 Joining the Ring — Step 3 • Transfer key responsibility • Connect to successor • Copy keys from successor to new node • Update successor pointer and remove keys • Only keys in the range are transferred Peer-to-Peer Networks — P. Felber

N21 N28 N32 Stabilization • Case 1: finger tables are reasonably fresh • Case 2: successor pointers are correct, not fingers • Case 3: successor pointers are inaccurate or key migration is incomplete — MUST BE AVOIDED! • Stabilization algorithm periodically verifies and refreshes node pointers (including fingers) • Basic principle (at node n): x = n.succ.predif x∈ (n, n.succ)n = n.succ notify n.succ • Eventually stabilizes the system when no node joins or fails Peer-to-Peer Networks — P. Felber

Failure of nodes might cause incorrect lookup N8 doesn’t know correct successor, so lookup of K19 fails Solution: successor list Each node n knows r immediate successors After failure, n knows first live successor and updates successor list Correct successors guarantee correct lookups Dealing With Failures m=6 2m-1 0 N1 N56 N8 lookup(K19) ? +1 +2 N51 +4 N14 N48 +8 +16 N18 K19 N42 N21 N38 N32 Peer-to-Peer Networks — P. Felber

Dealing With Failures (cont’d) • Successor lists guarantee correct lookup with some probability • Can choose r to make probability of lookup failure arbitrarily small • Assume half of the nodes fail and failures are independent • P(n.successor-list all dead) = 0.5r • P(n does not break the Chord ring) = 1 - 0.5r • P(no broken nodes) = (1 – 0.5r)N • r = 2log(N) makes probability = 1 – 1/N • With high probability (1/N), the ring is not broken Peer-to-Peer Networks — P. Felber

Evolution of P2P Systems • Nodes leave frequently, so surviving nodes must be notified of arrivals to stay connected after their original neighbors fail • Take time t with N nodes • Doubling time: time from t until N new nodes join • Halving time: time from t until N nodes leave • Half-life: minimum of halving and doubling time • Theorem: there exist a sequence of joins and leaves such that any node that has received fewer than k notifications per half-life will be disconnected with probability at least (1 – 1/(e-1))k≈ 0.418k Peer-to-Peer Networks — P. Felber

Chord and Network Topology Nodes numerically-close are not topologically-close (1M nodes = 10+ hops) Peer-to-Peer Networks — P. Felber

Circular m-bit ID space for both keys and nodes Addresses in base 2b with m / b digits Node ID = SHA-1(IP address) Key ID = SHA-1(key) A key is mapped to the node whose ID is numerically-closest the key ID N0002 N3200 N0201 K3122 K0220 N3033 N0322 N3001 N2222 N1113 N2120 K1201 K2120 N2001 K1320 Pastry (MSR) m=8 2m-1 0 b=2 Peer-to-Peer Networks — P. Felber

Pastry Lookup • Prefix routing from A to B • At hth hop, arrive at node that shares prefix with B of length at least h digits • Example: 5324 routes to 0629 via 5324 →0748 →0605 →0620 →0629 • If there is no such node, forward message to neighbor numerically-closer to destination (successor) 5324 →0748 →0605 →0609→0620 →0629 • O(log2b N) hops Peer-to-Peer Networks — P. Felber

For each prefix, a node knows some other node (if any) with same prefix and different next digit For instance, N0201: N-: N1???, N2???, N3??? N0: N00??, N01??, N03?? N02: N021?, N022?, N023? N020: N0200, N0202, N0203 When multiple nodes, choose topologically-closest Maintain good locality properties (more on that later) lookup(K2120) Pastry State and Lookup Routing table m=8 2m-1 0 b=2 N0002 N0122 N3200 N0201 N0212 N0221 N3033 N0233 N0322 N3001 N2222 N1113 N2120 K2120 N2001 Peer-to-Peer Networks — P. Felber

A Pastry Routing Table b=2, so node ID is base 4 (16 bits) m=16 b=2 Node ID 10233102 Contains the nodes that are numerically closest to local node MUST BE UP TO DATE Leaf set < SMALLER LARGER > 10233033 10233021 10233120 10233122 10233001 10233000 10233230 10233232 Routing Table 02212102 1 22301203 31203203 Entries in the mth column have m as next digit 0 11301233 12230203 13021022 10031203 10132102 2 10323302 10200230 10211302 10222302 3 nth digit of current node m/b rows 10230322 10231000 10232121 3 10233001 1 10233232 Entries in the nth row share the first n digits with current node [ common-prefix next-digit rest ] 0 10233120 2 Contains the nodes that are closest to local node according to proximity metric Neighborhood set 2b-1 entries per row Entries with no suitable node ID are left empty 13021022 10200230 11301233 31301233 02212102 22301203 31203203 33213321 Peer-to-Peer Networks — P. Felber

Pastry and Network Topology Expected node distance increases with row number in routing table Smaller and smaller numerical jumps Bigger and bigger topological jumps Peer-to-Peer Networks — P. Felber

X knows A (A is “close” to X) A 5324 Route message to node numerically closest to X’s ID Join message B 0748 D’s leaf set A0— ???? B1— 0??? C2— 06?? D4— 062? C 0605 A’s neighborhood set D 0620 Joining X joins X 0629 0629’s routing table Peer-to-Peer Networks — P. Felber

Locality • The joining phase preserves the locality property • First: A must be near X • Entries in row zero of A’s routing table are close to A, A is close to XX0 can be A0 • The distance from B to nodes from B1 is much larger than distance from A to B (B is in A0) B1 can be reasonable choice for X1, C2 for X2, etc. • To avoid cascading errors, X requests the state from each of the node in its routing table and updates its own with any closer node • This scheme works “pretty well” in practice • Minimize the distance of the next routing step with no sense of global direction • Stretch around 2-3 Peer-to-Peer Networks — P. Felber

Node Departure • Node is considered failed when its immediate neighbors in the node ID space cannot communicate with it • To replace a failed node in the leaf set, the node contacts the live node with the largest index on the side of failed node, and asks for its leaf set • To repair a failed routing table entry Rdl, node contacts first the node referred to by another entry Ril, id of the same row, and ask for that node’s entry for Rdl • If a member in the M table, is not responding, node asks other members for their M table, check the distance of each of the newly discovered nodes, and update its own M table Peer-to-Peer Networks — P. Felber

Cartesian space (d-dimensional) Space wraps up: d-torus Incrementally split space between nodes that join Node (cell) responsible for key k is determined by hashing k for each dimension hx(k) insert(k,data) hy(k) 1 1 1 2 2 4 2 retrieve(k) 3 3 CAN (Berkeley) d=2 1 Peer-to-Peer Networks — P. Felber

A node A only maintains state for its immediate neighbors (N, S, E, W) 2d neighbors per node Messages are routed to neighbor that minimizes Cartesian distance More dimensions means faster the routing but also more state (dN1/d)/4 hops on average Multiple choices: we can route around failures CAN State and Lookup d=2 B N W A E S Peer-to-Peer Networks — P. Felber

CAN Landmark Routing • CAN nodes do not have a pre-defined ID • Nodes can be placed according to locality • Use well known set of mlandmark machines (e.g., root DNS servers) • Each CAN node measures its RTT to each landmark • Orders the landmarks in order of increasing RTT: m! possible orderings • CAN construction • Place nodes with same ordering close together in the CAN • To do so, partition the space into m! zones: m zones on x, m-1 on y, etc. • A node interprets its ordering as the coordinate of its zone Peer-to-Peer Networks — P. Felber

CAN and Network Topology C C;A;B A Use mlandmarks to split space in m! zones C;B;A A;C;B Nodes get random zone in their zone Topologically-close nodes tend to be in the same zone A;B;C B B;C;A B;A;C Peer-to-Peer Networks — P. Felber

Topology-Awareness • Problem • P2P lookup services generally do not take topology into account • In Chord/CAN/Pastry, neighbors are often not locally nearby • Goals • Provide small stretch: route packets to their destination along a path that mimics the router-level shortest-path distance • Stretch: DHT-routing / IP-routing • Our solution • TOPLUS (TOPology-centric Look-Up Service) • An “extremist design” to topology-aware DHTs Peer-to-Peer Networks — P. Felber

Tier 0 Tier 1 Tier 2 Tier 3 ... ... ... n TOPLUS Architecture Group nodes in nested groups using IP prefixes: AS, ISP, LAN (IP prefix: contiguous address range of the form w.x.y.z/n) Use IPv4 address range (32-bits) for node IDs and key IDs Assumption: nodes with same IP prefix are topologically close IP Addresses Peer-to-Peer Networks — P. Felber

H0=I Tier 0 S1 H1 S2 Tier 1 H2 Tier 2 S3 H3 Tier 3 ... ... ... n Node State Each node n is part of a series of telescoping sets Hiwith siblings Si Node n must know all up nodes in inner group Node n must know onedelegate node in each tier i set S∈Si IP Addresses Peer-to-Peer Networks — P. Felber

Routing with XOR Metric • To lookup key k, node n forwards the request to the node in its routing table whose ID j is closest to k according to XOR metric • Let j = j31j30...j0— k = k31k30...k0 • Refinement of longest-prefix match • Note that closest ID is unique: d(j,k) = d(j’,k) ⇔ j = j’ • Example (8 bits) k = 10010110 j = 10110110 d(j,k) = 25 = 32 j’ = 10001001 d(j’,k) = 24 + 23 + 22 + 21 + 20 = 31 Peer-to-Peer Networks — P. Felber

Tier 0 Tier 1 Tier 2 Tier 3 ... ... ... n Prefix Routing Lookup Perform longest-prefix match against entries in routing table using XOR metric Route message to node in inner group with closest ID (according to XOR metric) Compute 32-bits key k (using hash function) k IP Addresses Peer-to-Peer Networks — P. Felber

... ... ... TOPLUS and Network Topology Smaller and smaller numerical and topological jumps Always move closer to the destination Peer-to-Peer Networks — P. Felber

Group Maintenance • To join the system, a node n find its closest node n’ • n copies the routing and inner-group tables of n’ • n modifies its routing table to satisfy a “diversity” property • Requires that the delegate nodes of n and n’ are distinct with high probability • Allows us to find a replacement delegate in case of failure • Upon failure, update inner-group tables • Lazy update of routing tables • Membership tracking within groups (local, small) Peer-to-Peer Networks — P. Felber

Tier 0 Tier 1 Tier 2 Tier 3 ... ... ... n k’ On-Demand Caching To look up k, create k’=k with r first bits replaced by w.x.y.z/r (node responsible for k in cache) Cache data in group (ISP, campus) with prefix w.x.y.z/r Extends naturally to multiple levels (cache hierarchy) k IP Addresses Peer-to-Peer Networks — P. Felber

Measuring TOPLUS Stretch • Obtained prefix information from • BGP tables from Oregon, Michigan Universities • Routing registries from Castify, RIPE • Sample of 1000 different IP address • Point-to-point IP measurements using King • TOPLUS distance: weighted average of all possible paths between source and destination • Weights: probability of a delegate to be in each group • TOPLUS stretch = TOPLUS distance / IP distance Peer-to-Peer Networks — P. Felber

Results • Original Tree • 250,562 distinct IP prefixes • Up to 11 levels of nesting • Mean stretch: 1.17 • 16-bit regrouping (>16 → 16) • Aggregate small tier-1 groups • Mean stretch: 1.19 • 8-bit regrouping (>16 → 8) • Mean stretch: 1.28 • Original+1: add one level with 256 8-bit prefixes • Mean stretch: 1.9 • Artificial, 3-tier tree • Mean stretch: 2.32 Peer-to-Peer Networks — P. Felber

TOPLUS Summary • Problems • Non-uniform ID space (requires bias in hash to balance load) • Correlated node failures • Advantages • Small stretch • IP longest-prefix matching allows fast forwarding • On-demand P2P caching straightforward to implement • Can be easily deployed in a “static” environment (e.g., multi-site corporate network) • Can be used as benchmark to measure speed of other P2P services Peer-to-Peer Networks — P. Felber

Other Issues: Hierarchical DHTs • The Internet is organized as a hierarchy • Should DHT designs be flat? • Hierarchical DHTs: multiple overlays managed by possibly different DHTs (Chord, CAN, etc.) • First, locate group responsible for key in top-level DHT • Then, find peer in next-level overlay, etc. • By designating the most reliable peers as super-nodes (part of multiple overlays), number of hops can be significantly decreased • How can we deploy, maintain such architectures? Peer-to-Peer Networks — P. Felber

Peer-to-Peer Networks