1 / 69

Module BSI: Big-Data Storage and Processing Infrastructures

This course explores the evolution of distributed systems and the advancements in peer-to-peer (P2P) technology. Topics include distributed hash tables, gossip protocols, and big data applications. The course also covers the advantages and challenges of P2P systems, such as scalability, fault tolerance, and load balancing. Students will gain an understanding of how to decentralize and parallelize systems for improved performance, availability, and resource aggregation.

mbaldwin
Download Presentation

Module BSI: Big-Data Storage and Processing Infrastructures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Module BSI: Big-Data Storage and Processing Infrastructures Gabriel Antoniu, KERDATA & Davide Frey, WIDE INRIA

  2. Administrivia • Gabriel Antoniu, DR INRIA, KERDATA Team gabriel.antoniu@inria.fr • Davide Frey, CR INRIA, WIDE Team davide.frey@inria.fr • Course Website: http://bsi-eit.irisa.fr/

  3. Exam • Select research paper • Home Assignment: report on the paper • Oral Exam: give a talk

  4. Half the Course in one slide • Historical Perspective: • Peer-to-Peer -> Grid -> Cloud-> Edge • Basic Technologies • Distributed Hash Tables • Gossip Protocols • Non-Big-Data Applications • Multicast • Video Streaming • Big-Data Applications • Key Value Stores • Recommendation Systems • Private Machine Learning

  5. IBDFrom Peer-to-Peer to the Edge

  6. Distributed System • Definition [Tan95] « A collection of independent computers that appears to its users as a single coherent system » Distributed applications Software: Unique image Distributed Systems Local OS Hardware: Autonomous computers network

  7. Why should we decentralize/parallelize ? • Economical reasons • Performance • Availability • Resource aggregation • Flexibility (load balancing) • Privacy Growing need of working collaboratively, sharing and aggregating distributed (geographically) distributed.

  8. How to decentralize ? • Scalability • Reliability • Availability • Security/Privacy Tools (some) Goals • Client-server Model • Peer-to-peer Model • Grid Computing • Cloud

  9. The Peer-To-Peer Model • Name one peer-to-peer technology

  10. Peer-to-Peer Systems

  11. The Peer-to-Peer Model • End-nodes become active components! • previously they were just clients • Nodes participate, interact, contribute to the services they use. • Harness huge pools of resources accumulated in millions of end-nodes.

  12. P2P Application Areas • Content • File sharing • Information Mgmt • Discover • Aggregate • Filter • Collaboration • Instant Messaging • Shared whiteboard • Co-review/edit/author • Gaming • Storage • Network Storage • Caching • Replication • CPU • Internet/Intranet Distributed Computing • Grid Computing • Bandwidth • Content Distribution • Collaborative download • Edge Services • VoIP

  13. Grid Computing • Collection of computing resources on LAN or WAN • Appears as large virtual compuiting system • Focus on high-computational capacity • May use some P2P technologies but can exploit “central” components

  14. Cloud Computing • Provide Service-level computing • On-demand instances • On-demand provisioning • May exploit • Grids • Clusters • Virtualization

  15. Edge Computing • Ever used Dropbox to send a file to your deskmate? • Augment Cloud with edge of the network • Bring back P2P ideas into the cloud model

  16. Back to P2P • Use resources at the edge of the Internet • Storage • CPU cycles • Bandwidth • Content • Collectively produce services • Nodes share both benefitsandduties • Irregularities and dynamics become the norm Essential for Large Scale Distributed System

  17. Main Advantages of P2P • Scalable • higher demand  higher contribution! • Increased (massive) aggregate capacity • Utilize otherwise wasted resources • Fault Tolerant • No single point of control • Replication makes it possible to withstand failures • Inherently handle dynamic conditions

  18. Main Challenges in P2P • Fairness and Load Balancing • Dynamics and Adaptability • Fault-Tolerance: Continuous Maintenance • Self-Organization

  19. B A C Overlay Network Key Concept: Overlay Network Physical Network

  20. Overlay types Unstructured P2P Structured P2P • Any two nodes can establish a link • Topology evolves at random • Topology reflects desired properties of linked nodes • Topology strictly determined by node IDs

  21. IBDDistributed Hash Tables

  22. Hash Table Hash function Hash values Keys Stored data Efficient information lookup

  23. Distributed Hash Table (DHT) nodes k1,v1 k2,v2 k3,v3 • Store <key,value> pairs • Efficient access to a value given a key • Must route hash keys to nodes. Operations: insert(k,v) lookup(k,v) k4,v4 k5,v5 k6,v6 routing

  24. Distributed Hash Table nodes k1,v1 k2,v2 k3,v3 • Insert and Lookupsend messages keys • P2P Overlay definesmappingbetween keys and physicalnodes • Decentralizedroutingimplementsthismapping Operations: insert(k,v) lookup(k,v) P2P overlay network k4,v4 map to send(m,k) k5,v5 k6,v6

  25. DHT Examples

  26. Pastry (MSR/RICE) node Id space key NodeId = 128 bits Nodes and key place in a linear space (ring) Mapping : a key is associated to the node with the numerically closest nodeId to the key

  27. Pastry (MSR/Rice) Naming space : • Ring of 128 bit integers • nodeIds chosen at random • Identifiers are a set of digits in base 16 Key/node mapping • key associated with the node with the numerically closest node id Routing table: • Matrix of 128/4 lines et 16 columns • routeTable(i,j): nodeId matching the current node identifier up to level I with the next digit is j Leaf set • 8 or 16 closest numerical neighbors in the naming space Proximity Metric Bias selection of nodes

  28. Pastry: Routing table(#65a1fcx) Line 0 Line 1 Line 2 Line 3 log16 N liges

  29. Pastry: Routing Properties • log16 N hops • Size of the state maintained (routing table): O(log N) d471f1 d467c4 d462ba d46a1c d4213f Route(d46a1c) d13da3 65a1fc

  30. Routing algorithm (on node A) leaf set

  31. Pastry Example b=4 100 023 Route to 311 333 321 132 313 310 282 232

  32. Pastry Example b=4 100 023 Route to 311 333 321 132 313 310 282 232

  33. Pastry Example b=4 100 023 Route to 311 333 321 132 313 310 282 232

  34. Pastry Example b=4 100 023 Route to 311 333 321 132 313 310 282 232

  35. Pastry Example b=4 100 023 Route to 311 333 321 132 313 310 282 232

  36. Node departure Explicit departure or failure Replacement of a node The leafset of the closest node in the leafset contains the closest new node, not yet in the leafset Update from the leafset information Update the application

  37. Failure detection Detected when immediate neighbours in the name space (leafset) can no longer communicate Detected when a contact fails during the routing Routing uses an alternative route

  38. Fixing the routing table of A • Repair

  39. State maintenance Leaf set • is aggressively monitored and fixed Routing table • are lazily repaired When a hole is detected during the routing • Periodic gossip-based maintenance

  40. Reducing latency • Random assignment of nodeId: Nodes numerically close are geographically (topologically) distant • Objective: fill the routing table with nodes so that routing hops are as short (latency wise) as possible • Topological Metric: latency d467f5 d467c4 6fdacd

  41. Exploiting locality in Pastry Neighbour selected based of a network proximity metric: • Closest topological node • Satisfying the constraints of the routing table routeTable(i,j): • nodeId corresponding to the current nodeId up to level i next digit = j • nodes are close at the top level of the routing table • Farther nodes at the bottom levels of the routing tables

  42. d467c4 Topological space d4213f 65a1fc d462ba d13da3 Proximity routing in Pastry Leaf set d471f1 d467c4 d462ba d46a1c d4213f Route(d46a1c) d13da3 65a1fc Naming space

  43. Locality Joining node X routes asks A to route to X • Path A,B,… -> Z • Z numerically closest to X • X initializes line i of its routing table with the contents of line i of the routing table of the ith node encountered on the path Improving the quality of the routing table • X asks to each node of its routing table its own routing state and compare distances • Gossip-based update for each line (20mn) Periodically, an entry is chosen at random in the routing table Corresponding line of this entry sent Evaluation of potential candidates Replacement of better candidates New nodes gradually integrated

  44. d467c4 d471f1 d467c4 d462ba d46a1c d4213f Topological space Route(d46a1c) d13da3 65a1fc d4213f New node: d46a1c 65a1fc Naming space d462ba d13da3 Node insertion in Pastry

  45. Performance 1.59 slower than IP on average

  46. References Rowstron and P. Druschel, "Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems", Middleware'2001, Germany, November 2001.

  47. Content Adressable Network -CAN UCB/ACIRI Virtual-coordinate Cartesian space Space shared among peers • Each node is responsible for a part (zone) Abstraction • CAN stores data at specific points in the space • CAN routes information from one point of the space to another (DHT functionality) • A point is associated with the node that owns the zone in which the point lies

  48. Space organisation in CAN node Name space Key D-dimension space Routing: progression within the space towards The destination

  49. CAN: Example

  50. CAN: Example

More Related