1 / 47

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications. Frans Kaashoek kaashoek@lcs.mit.edu Joint work with: H. Balakrishnan, P. Druschel , J. Hellerstein , D. Karger, R. Karp, J. Kubiatowicz, B. Liskov, D. Mazi è res, R. Morris, S. Shenker, I. Stoica.

gypsy
Download Presentation

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISTRIBUTED HASH TABLESBuilding large-scale, robust distributed applications Frans Kaashoek kaashoek@lcs.mit.edu Joint work with: H. Balakrishnan, P. Druschel , J. Hellerstein , D. Karger, R. Karp, J. Kubiatowicz, B. Liskov, D. Mazières, R. Morris, S. Shenker, I. Stoica

  2. P2P: an exciting social development • Internet users cooperating to share, for example, music files • Napster, Gnutella, Morpheus, KaZaA, etc. • Lots of attention from the popular press “The ultimate form of democracy on the Internet” “The ultimate threat to copy-right protection on the Internet” • Many vendors have launched P2P efforts

  3. What is P2P? Client Client Client • A distributed system architecture: • No centralized control • Nodes are symmetric in function • Typically many nodes, but unreliable and heterogeneous Internet Client Client

  4. Traditional distributed computing:client/server Server • Successful architecture, and will continue to be so • Tremendous engineering necessary to make server farms scalable and robust Client Client Internet Client Client

  5. Application-level overlays Site 2 Site 3 N • One per application • Nodes are decentralized • NOC is centralized N N ISP1 ISP2 Site 1 N N ISP3 N Site 4 P2P systems are overlay networks without central control

  6. (Potential) P2P advantages • Allows for scalable incremental growth • Aggregate tremendous amount of computation and storage resources • Tolerate faults or intentional attacks

  7. Example P2P problem: lookup N2 N1 N3 Key=“title” Value=file data… Internet ? Client Publisher Lookup(“title”) N4 N6 N5 • At the heart of all P2P systems

  8. Centralized lookup (Napster) N2 N1 SetLoc(“title”, N4) N3 Client DB N4 Publisher@ Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 Simple, but O(N) state and a single point of failure

  9. Flooded queries (Gnutella) N2 N1 Lookup(“title”) N3 Client N4 Publisher@ Key=“title” Value=MP3 data… N6 N8 N7 N9 Robust, but worst case O(N) messages per lookup

  10. Another approach: distributed hash tables • Nodes are the hash buckets • Key identifies data uniquely • DHT balances keys and data across nodes • DHT replicates, caches, routes lookups, etc. Distributed applications data Lookup (key) Insert(key, data) Distributed hash tables …. node node node

  11. Why DHTs now? • Demand pulls • Growing need for security and robustness • Large-scale distributed apps are difficult to build • Many applications use location-independent data • Technology pushes • Bigger, faster, and better: every PC can be a server • Scalable lookup algorithms are available • Trustworthy systems from untrusted components

  12. DHT is a good interface DHT UDP/IP • Supports a wide range of applications, because few restrictions • Keys have no semantic meaning • Value is application dependent • Minimal interface Send(IP address, data) Receive (IP address)  data lookup(key)  data Insert(key, data)

  13. DHT is a good shared infrastructure • Applications inherit some security and robustness from DHT • DHT replicates data • Resistant to malicious participants • Low-cost deployment • Self-organizing across administrative domains • Allows to be shared among applications • Large scale supports Internet-scale workloads

  14. DHTs support many applications • File sharing [CFS, OceanStore, PAST, …] • Web cache [Squirrel, ..] • Censor-resistant stores [Eternity, FreeNet,..] • Event notification [Scribe] • Naming systems [ChordDNS, INS, ..] • Query and indexing [Kademlia, …] • Communication primitives [I3, …] • Backup store [HiveNet] • Web archive [Herodotus] data is location-independent

  15. Cooperative read-only file sharing • DHT is a robust block store • Client of DHT implements file system File system block Lookup (key) insert (key, block) Distributed hash tables …. node node node

  16. File representation:self-authenticating data File System key=995 431=SHA-1 144 = SHA-1 901= SHA-1 … 995: key=901 key=732 Signature key=431 key=795 “a.txt” ID=144 … (i-node block) … … (data) (root block) (directory blocks) • DHT key for block is SHA-1(content block) • File and file systems form Merkle hash trees

  17. DHT distributes blocks by hashing IDs Block 732 Block 705 Node B 995: key=901 key=732 Signature 247: key=407 key=992 key=705 Signature Node A Internet Block 407 Node C Node D Block 901 Block 992 • DHT replicates blocks for fault tolerance • DHT caches popular blocks for load balance

  18. Historical web archiver • Goal: make and archive a daily check point of the Web • Estimates: • Web is about 57 Tbyte, compressed HTML+img • New data per day: 580 Gbyte • 128 Tbyte per year with 5 replicas • Design: • 12,810 nodes: 100 Gbyte disk each and 61 Kbit/s per node

  19. Implementation using DHT Crawler Client Insert(sha-1(URL), page) Lookup (URL) Insert(sha-1(URL), URL) • DHT usage: • Crawler distributes crawling load by hash(URL) • Crawler inserts Web pages by hash(URL) • Client retrieve Web pages by hash(URL) • DHT replicates data for fault tolerance Distributed hash tables …. node node node

  20. Backup store • Goal: backup on other user’s machines • Observations • Many user machines are not backed up • Backup requires significant manual effort • Many machines have lots of spare disk space • Using DHT: • Merkle tree to validate integrity of data • Administrative and financial costs are less for all participants • Backups are robust (automatic off-site backups) • Blocks are stored once, if key = sha1(data)

  21. Research challenges • Scalable lookup • Balance load (flash crowds) • Handling failures • Coping with systems in flux • Network-awareness for performance • Robustness with untrusted participants • Programming abstraction • Heterogeneity • Anonymity Goal: simple, provably-good algorithms this talk

  22. K5 K20 N105 Circular ID space N32 N90 K80 N60 1. Scalable lookup • Map keys to nodes in a load-balanced way • Hash keys and nodes into a string of digit • Assign key to “closest” node • Forward a lookup for a key to a closer node • Insert: lookup + store • Join: insert node in ring Examples: CAN, Chord, Kademlia, Pastry, Tapestry, Viceroy, ….

  23. Chord’s routing table: fingers ½ ¼ 1/8 1/16 1/32 1/64 1/128 N80

  24. Lookups take O(log(N)) hops N5 N10 N110 K19 N20 N99 N32 Lookup(K19) N80 N60 • Lookup: route to closest predecessor

  25. (0,1) (0.5,0.5, 1, 1) • • (0,0.5, 0.5, 1) (0.5,0.25, 0.75, 0.5) • • • (0.75,0, 1, 0.5) (0,0, 0.5, 0.5) (0,0) (1,0) CAN: exploit d dimensions • Each node is assigned a zone • Nodes are identified by zone boundaries • Join: chose random point, split its zone

  26. (0,1) (0.5,0.5, 1, 1) • • (0,0.5, 0.5, 1) (0.5,0.25, 0.75, 0.5) • • • (0.75,0, 1, 0.5) (0,0, 0.5, 0.5) (0,0) (1,0) Routing in 2-dimensions • Routing is navigating a d-dimensional ID space • Route to closest neighbor in direction of destination • Routing table contains O(d) neighbors • Number of hops is O(dN1/d)

  27. 2. Balance load N5 K19 N10 N110 K19 N20 N99 N32 Lookup(K19) N80 N60 • Hash function balances keys over nodes • For popular keys, cache along the path

  28. Why Caching Works Well N20 • Only O(log N) nodes have fingers pointing to N20 • This limits the single-block load on N20

  29. K19 K19 K19 3. Handling failures: redundancy N5 N10 N110 N20 N99 N32 N40 N80 N60 • Each node knows IP addresses of next r nodes • Each key is replicated at next r nodes

  30. Lookups find replicas N5 N10 N110 3. N20 1. 2. N99 K19 N40 4. N50 N80 N68 N60 Lookup(K19) • Tradeoff between latency and bandwidth [Kademlia]

  31. 4. Systems in flux • Lookup takes log(N) hops If system is stable But, system is never stable! • What we desire are theorems of the type: • In the almost-ideal state, ….log(N)… • System maintains almost-ideal state as nodes join and fail

  32. Half-life [Liben-Nowell 2002] N new nodes join N nodes • Doubling time: time for N joins • Halfing time: time for N/2 old nodes to fail • Half life: MIN(doubling-time, halfing-time) N/2 old nodes leave

  33. Applying half life • For any node u in any P2P networks: If u wishes to stay connected with high probability, then, on average, u must be notified about (log N) new nodes per half life • And so on, …

  34. 5. Optimize routing to reduce latency N20 • Nodes close on ring, but far away in Internet • Goal: put nodes in routing table that result in few hops and low latency N41 N40 N80

  35. “close” metric impacts choice of nearby nodes N06 USA N105 • Chord’s numerical close and table restrict choice • Prefix-based allows for choice • Kademlia’s offers choice in nodes and places nodes in absolute order: close (a,b) = XOR(a, b) USA K104 Far east N103 N32 Europe N60 USA

  36. Neighbor set N06 USA USA N105 • From k nodes, insert nearest node with appropriate prefix in routing table • Assumption: triangle inequality holds K104 N103 N32 Far east Europe N60 USA

  37. Finding k near neighbors • Ping random nodes • Swap neighbor sets with neighbors • Combine with random pings to explore • Provably-good algorithm to find nearby neighbors based on sampling [Karger and Ruhl 02]

  38. 6. Malicious participants • Attacker denies service • Flood DHT with data • Attacker returns incorrect data [detectable] • Self-authenticating data • Attacker denies data exists [liveness] • Bad node is responsible, but says no • Bad node supplies incorrect routing info • Bad nodes make a bad ring, and good node joins it • Basic approach: use redundancy

  39. Sybil attack [Douceur 02] N5 • Attacker creates multiple identities • Attacker controls enough nodes to foil the redundancy N10 N110 N20 N99 N32 N40 N80 N60 • Need a way to control creation of node IDs

  40. One solution: secure node IDs • Every node has a public key • Certificate authority signs public key of good nodes • Every node signs and verifies messages • Quotas per publisher

  41. Another solution:exploit practical byzantine protocols N06 N105 N • A core set of servers is pre-configured with keys and perform admission control • The servers achieve consensus with a practical byzantine recovery protocol [Castro and Liskov ’99 and ’00] • The servers serialize updates [OceanStore] or assign secure node Ids [Configuration service] N N N103 N32 N N60

  42. A more decentralized solution:weak secure node IDs • ID = SHA-1 (IP-address node) • Assumption: attacker controls limited IP addresses • Before using a node, challenge it to verify its ID

  43. Using weak secure node IDS • Detect malicious nodes • Define verifiable system properties • Each node has a successor • Data is stored at its successor • Allow querier to observe lookup progress • Each hop should bring the query closer • Cross check routing tables with random queries • Recovery: assume limited number of bad nodes • Quota per node ID

  44. 7. Programming abstraction • Blocks versus files • Database queries (join, etc.) • Mutable data (writers) • Atomicity of DHT operations

  45. Philosophical questions • How decentralized should systems be? • Gnutella versus content distribution network • Have a bit of both? (e.g., OceanStore) • Why does the distributed systems community have more problems with decentralized systems than the networking community? • “A distributed system is a system in which a computer you don’t know about renders your own computer unusable” • Internet (BGP, NetNews)

  46. What are we doing at MIT? • Building a system based on Chord • Applications: CFS, Herodotus, Melody, Backup store, R/W file system, … • Collaborate with other institutions • P2P workshop • Big ITR • Building a large-scale testbed • RON, PlanetLab

  47. Summary • Once we have DHTs, building large-scale distributed applications is easy • Single, shared infrastructure for many applications • Robust in the face of failures and attacks • Scalable to large number of servers • Self configuring across administrative domains • Easy to program • Let’s build DHTs …. stay tuned ….

More Related