1 / 40

P2P Storage Systems: Basics, Networking, Routing, and Storage

This paper discusses the basics of peer-to-peer storage systems, including networking, routing, storage, and caching. It explores the roles of client, server, router, and cache and highlights the desired characteristics of these systems, such as fast, tolerant, scalable, reliable, and good locality. The paper also provides an overview of Pastry, a decentralized, fault-resilient, and scalable P2P object location and routing scheme.

bogden
Download Presentation

P2P Storage Systems: Basics, Networking, Routing, and Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peer-to-Peer (P2P) Storage SystemsCSE 581 Winter 2002 Sudarshan “Sun” Murthy smurthy@sunlet.net

  2. Papers • Rowstron A, Druschel P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems • Rowstron A, Druschel P. Storage management and caching in PAST, a large-scale, persistent peer­to­peer storage utility • Zhao, et al. Tapestry: An infrastructure for fault­resilient wide­area location and routing • Kubiatowicz J, et al. OceanStore: An architecture for global­scale persistent store CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  3. P2P Storage Systems: Basics • Needs • Networking, routing, storage, caching • Roles • Client, server, router, cache • Desired characteristics • Fast, tolerant, scalable, reliable, good locality • Small world: keep the clique, and reach everything fast! CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  4. Pastry: Claims • A generic P2P object location and routing scheme based on a self-organizing overlay network of nodes connected to the Internet. • Features • Decentralized • Fault-resilient • Scalable • Reliable • Good route locality CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  5. Pastry: 100K Feet View • Nodes interact with local applications • Each node has a unique 128-bit ID (base 2b) • Cryptographic hash of IP address, usually • Node IDs are distributed in geography, etc. • Nodes route messages based on the “key” • To a node whose ID shares more prefix digits • To a node with numerically closer ID • Can route in less than (log2b N) hops, usually CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  6. Pastry: Node State • Routing table (R) • (log2b N) rows with (2b -1) entries in each row (|R|) • Row n lists nodes that share ID in first n digits • Neighborhood set (M) • Lists |M| nodes that are closest according to the “proximity metric” (application defined) • Leaf set (L) • Lists (|L|/2) nodes with numerically closest smaller IDs and (|L|/2) nodes with numerically closest larger IDs CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  7. Pastry: Parameters • Numeric base of IDs (b) • |R| = (log2b N) * (2b -1); Max. hops = (log2b N) • b = 4 & N = 106→ |R| = 75 & max. hops = 5 • b = 4 & N = 109→ |R| = 105 & max. hops = 7 • Number of entries in Leaf set (|L|) • Entries in L are not sensitive to “key”, entries in R could be • |L| = 2b or 2b+1, usually • Routing could fail if (|L|/2) nodes fail simultaneously CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  8. Pastry: Routing Algorithm • Check if key falls in the range of IDs in L • Route to the node with numerically closest ID • Check R for node ID with largest prefix (larger than that shared with this node) • Route to the node that shares largest prefix • Entry may be empty, or node may be unavailable • Check L for node with same prefix length • Route to the node with numerically closer ID CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  9. Pastry: Adaptation • Arriving nodes • New node X sends “Join” message to node A; message is routed around to X through node Z • X gets initial L from Z, M from A, ith row of R from ith node visited; X then sends its stateto all nodes visited • Departing/failed nodes • Nodes test connectivity of entries in M periodically • Nodes repair L & M using info. from other nodes • Nodes repair R using entries at same level from other nodes; they borrow entries at next level if needed CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  10. Pastry: Locality • Entries in R and M are based on a proximity metric decided by the application • Decision taken with local information only • No guarantee of complete path being shortest distance • Assumes triangulation inequality holds for distances • Misses nearby nodes with different prefix • Estimates density of node IDs in the ID space • Heuristically switches between modes to address problems; details are sketchy (very) (Section 2.5) CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  11. Pastry: Evaluation (1) • Number of routing hops (percentage probability) • 2 (1.6%), 3 (15.6%), 4 (64.5%), 5 (17.5%) (Fig. 5) • Effect of fewer routing entries compared to network with complete routing tables • At least 30% longer, at most 40% longer (Fig. 6) • 75 entries Vs 99,999 entries for a 100K node network! • Experiments with only one set of parameter values!! • Ability to locate closest among k nodes • Closest: 76%, Top 2: 92%, Top 3: 96% (Fig. 8) CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  12. Pastry: Evaluation (2) • Impact of failures and repairs on route quality • Number of routing hops Vs node failure (Fig. 10) • 2.73 (no failure), 2.96 (no repair), 2.74 (with repair) • 5K node network, 10% nodes failing • Poor parameters used • Average cost of repairing failed nodes • 57 remote procedure calls per failed node • Seems expensive CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  13. PAST: Pastry Application • Storage management system • Archival storage and content distribution utility • No support for search, directory lookup, key distribution • Nodes and files have uniformly distributed IDs • Replicas of files are stored at nodes whose IDs match file IDs closely • Files may be encrypted • Clients retrieve files using file ID as key CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  14. PAST: Insert Operation • Inserts a file in k nodes; returns a 160-bit ID • File ID is a secure hash of the file name, owner’s public key, and some salt; the operation is aborted if ID collision occurs • Copies of file are stored on k nodes whose ID is closest to the 128 MSB’s of the file ID • The required storage (k * file size) is debited against the client’s storage quota CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  15. PAST: File Certificate (FC) • A FC is issued when Insert operation starts • Has File ID, hash of file content, replication factor k, ... • FC is routed with file contents using Pastry • Each node verifies FC and file, stores a copy of file, attaches a Store Receipt, and forwards the message • Operation aborts if anything goes wrong • ID collision, invalid FC, corrupt file contents • Insufficient storage space CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  16. PAST: Other Operations • Retrieve a file • Retrieves a copy of the file with given ID from the first node that stores it • Reclaim storage • Reclaims storage allocated to specified file • Client issues a Reclaim Certificate (RC) to prove ownership of the file • RC is routed with the message to all nodes; storing nodes verify RC and issue a Reclaim Receipt • Client uses reclaim receipts to get storage credit • No guarantees about the state of reclaimed files CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  17. PAST: Storage Management • Goals • Balance free space among nodes as utilization increases • Ensure that a file is stored at k nodes, • Balance number of files stored on nodes • Storage capacities of nodes cannot differ by more than two orders of magnitude • Nodes with capacity out of bounds are rejected • Large capacity nodes can form a cluster CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  18. tpri, tdiv control diversion PAST: Diversions • Replica diversion, if no space at node • File size/free space > a threshold to store a file • Node A asks node B (from its L) to store replica • A stores a pointer to B for that file; A must retrieve file from somewhere if B fails (must have k copies)! • Node C (from A’s L) also has pointer to B for that file; useful to reach B if A fails, but C must be in the path • File diversion, if k nodes can’t store file • Restart Insert operation with a different salt (3 tries) CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  19. PAST: Caching • Caching is optional, at discretion of nodes • A file routed through a node during lookup/insert maybe cached • Each node visited during insert stores a copy; lookup returns the first copy found; what are we missing? • Based on Greedy Dual-Size policy developed for caching in web proxies • Replace file d with least c(d)/s(d) if cache is full; c=cost, s=size; if c(d)=1, replaces the largest file CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  20. PAST: Security • Smartcards ensure integrity of IDs and certificates • Store receipts ensure k nodes cooperate • Routing table entries are signed, and they can be verified by other nodes • A malicious node can cause problems • Choosing next node at random might help somewhat CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  21. PAST: Evaluation (1) • Basic storage management • No diversions: 51.1% of insertions fail; global storage utilization is only 60.8%: we need storage management • |L| = 32, tpri = 0.1, tdiv = 0.05 are optimal values • Effect of storage management (Fig. 6) • 10% replica diversion at 80% utilization; 15% at 95% • Small file insertion tends to fail after 80% utilization; 20% file insertions tend to fail after 95% utilization • Results are worse with file-system style workload CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  22. PAST: Evaluation (2) • Effect of cache (Fig. 8) • Experiments use no caching, LRU, and GD-S; GD-S performs marginally better than LRU • Hard to know if results are good since we have no comparison with other systems; only proves caching helps • What we did not see • Retrieval, reclaim performance; # of hops maybe for insertion • Overlay routing overhead; effort to cache CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  23. PAST: Possible Improvements • Avoid replica diversion • Forward on to the next node if no space • May have to add directory service to improve retrieval; directory service could be useful anyway • Reduce replica diversion or # of forwards • Add storage stats to routing table; use to pick next node • How to increase storage capacity? • Add masters (at least at cluster level) • Will not be as P2P any more!? CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  24. Tapestry: Claims • An overlay location & routing infrastructure for location-independent routing of messages directly to the closest copy of an object or service using only point-to-point links, and without centralized resources • Enhances Plaxton distributed search technique to improve availability, scalability, and adaptation • More formally defined, better analyzed than Pastry’s techniques; benefit of using Plaxton CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  25. Tapestry: 100K Feet View • Nodes and objects have unique 160-bit IDs • Nodes route messages using destination ID • To a node whose ID shares longer suffix • Can route in less than (logb N) hops • Objects are located by routing to a surrogate root • Servers publish their objects to surrogate roots; how objects get to servers is not a concern Compare with Pastry CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  26. Tapestry: Node State • Neighbor map • (logb N) levels (rows) with b entries at each level • Entry i at level j belongs to the closest node whose ID ends with “i” + “j-1 suffix digits of current node ID” • Back pointer list • ID of nodes that refer to this node as neighbor • Object location pointers • Tuples of the form <Object ID, Node ID> • Hotspot monitor • Tuples of the form <Object ID, Node ID, Frequency> CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  27. Tapestry: Parameters • Numeric base of IDs (b) • Entries= b * (logb N); max. hops = (logb N) • b =16, N = 106→ Entries = 80; max. hops = 5 • b = 16, N = 109→ Entries = 120; max. hops = 8 CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  28. Tapestry: Routing Algorithm • A message at nth node shares at least n suffix digits with ID of the nth node • Go to level (n+1) of neighbor map • Find a closest node whose ID shares n suffix digits of current node ID • Route to the node determined • If such node is not found, then current node must be the (or a) root node • Message may contain a predicate to find the next node, in addition to just using the closest node CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  29. Tapestry: Surrogate Roots • Uses multiple surrogate roots • Avoid single point of failure • Add a constant sequence of salts to create IDs; the resulting IDs are published, each ID gets a potentially different surrogate root • Finding surrogate roots isn’t always easy • Neighbors are used to find nodes that share at least a digit with the object ID • This part of paper isn’t very clear; work in progress? CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  30. Adding new nodes can be expensive Tapestry: Adaptation (1) • Arriving nodes • New node X sends a message to itself through node A; the last node visited is the root node for X; X gets initial ith level of neighbor map from ith node • X sends “hello” to its new neighbors • Departing/failed nodes • Send heartbeats on UDP packets using back pointers • Secondary neighbors are used when a neighbor fails • Failed neighbors get a second chance period; they are marked valid again if they respond within this period CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  31. Tapestry: Adaptation (2) • Uses Introspective Optimizations • Attempts to use statistics to adapt • Network tuning • Use pings to update n/w latency to neighbors • Optimize neighbor maps if latency > threshold • Hotspot caching • Monitor frequency of requests to objects • Advise application of need to cache CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  32. Tapestry: Evaluation • Locality • # hops is 2 to 4 times that of # network hops (Fig. 8); better when base of ID is greater (Fig. 18) • Effect of multiple roots • Latency reduces with increasing roots (Fig. 13), while bandwidth used increases (Fig. 14) • Performance under stress • Better throughput (Fig. 15) and average response time (Fig. 16) than centralized directory servers at higher loads CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  33. OceanStore: Tapestry Application • Storage management system with support for nomadic data, and constructed from a possibly untrusted infrastructure • Proposes a business & revenue model • A goal is to support roughly 100 tera users • Uses Tapestry for networking (a recent change) • Promotes promiscuous caching to improve locality • Replicas are stored independent of server that stores them (floating replicas) Contradicts Tapestry paper CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  34. OceanStore: 100K Feet View • Objects are identified using GUIDs • Clients access objects using GUIDs as destination ID • Objects may be servers, routers, data, directories, … • Many versions of objects might be stored • Update causes a new version of the object; the latest, updatable version is the “active” form, others are “archival” forms; archival forms are encoded with an erasure code • Sessions guarantee consistency- loose to ACID • Supports Access Control: Read/Write CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  35. OceanStore: Updates (1) • Client initiates updates as a set of predicates combined with actions • A replica applies the actions associated with the first true predicate (commit); update fails if all predicates fail (abort) • The update attempt is logged regardless • Replicas are not trusted with unencrypted info. • version and size comparisons are done on plaintext metadata; others must be done over ciphertext CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  36. I just consult references  OceanStore: Updates (2) • Assumes a position-dependent block cipher • compare, replace, insert, delete, append blocks • Uses a fancy algorithm for searching within blocks CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  37. OceanStore: Consistency Guarantee • Replica tiers used to serialize & authorize updates • A small number of primary tiers work with each other in a Byzantine agreement protocol • Larger number of secondary tiers are organized into multicast tree(s) • Client sends updates to the network • All replicas apply the updates • Updates from primary tiers are multicast back to the network; Version at other replicas is tentative until then CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  38. OceanStore: Evaluation • Prototype under development at time of paper publication • Web site shows a prototype is out, but no stats • Issues • Is there such a thing as “too untrusting”? • Risks of version proliferation • Access control needs work • Directory service squeezed in? CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  39. Conclusions • Pastry and Tapestry • Somewhat similar in routing; Tapestry more polished • Tapestry stores references, Pastry stores copies • PAST and OceanStore • OceanStore needs caching more than PAST; Storage management in PAST is a good idea, needs more work • No directory services in PAST, OceanStore has some • 3rd party evaluation of systems needed • Research opportunity? Object people meet Systems people CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

  40. References • Visit this URL to see this presentation, list of references, etc. http://www.cse.ogi.edu/~smurthy/p2ps/index.html CSE581 (Winter '02): Peer to Peer Storage Systems (c) Sudarshan "Sun" Murthy

More Related