1 / 58

Searching and Data Sharing in P2P Systems

Searching and Data Sharing in P2P Systems. Beng Chin Ooi Department of Computer Science National University of Singapore ooibc@comp.nus.edu.sg www.comp.nus.edu.sg/~ooibc. Acknowledgement. A few ppt slides are borrowed/adapted from Hellerstein’s group and his vldb-04 tutorial slides

enrico
Download Presentation

Searching and Data Sharing in P2P Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching and Data Sharing in P2P Systems Beng Chin Ooi Department of Computer Science National University of Singapore ooibc@comp.nus.edu.sg www.comp.nus.edu.sg/~ooibc

  2. Acknowledgement • A few ppt slides are borrowed/adapted from Hellerstein’s group and his vldb-04 tutorial slides • Some are screen dumps as examples

  3. Client Server Architecture Peer-to-Peer Architecture What is P2P?

  4. P2P Systems? • Effective Use of the Internet-connected PCs/workstations directly participate in the Internet • Sites are autonomous • Similar functionalities and responsibilities • Each peer consumes and serves • Resources are distributed

  5. Driving Forces • Main driving forces: • Exploiting existing resources • Computational efficiency is not the main goal • Sharing costs among users • Autonomy • Anonymity • Legal protection

  6. P2P Systems “ A class of applications that takes advantage of resources like storage, CPU cycles, content and even human presence available at the edges of the Internet” -- Clay Shirkey, an investment advisor

  7. P2P Applications Groove P2P Messenger SETI Folding@home Upriser freenet

  8. Properties of P2P Applications? • Dynamic and Self-Organizing • Enduring • Resilient • Collaborative

  9. P2P Future • Aberdeen Group’s prediction: • US$930 million by end 2004 • From US$20.6 at end of 2000 • Standardization • NPI (New Productivity Initiative) • Peer-to-Peer Working Group (P2PWG) • NAT, Taxonomy, Security, File Services, Interoprability

  10. Overlay Networks • P2P applications need to: • Track identities & (IP) addresses of peers • May be many! • May have significant Churn • Best not to have n2 ID references • Route messages among peers • If you don’t keep track of all peers, this is “multi-hop” • This is an overlay network • Peers are doing both naming and routing • IP becomes “just” the low-level transport • All the IP routing is opaque • Control over naming and routing is powerful • And as we’ll see, brings networks into the database era

  11. Infecting the Network, Peer-to-Peer • The Internet is hard to change. • But Overlay Nets are easy! • P2P is a wonderful “host” for infecting network designs • The “next” Internet is likely to be very different • “Naming” is a key design issue today • Querying and data independence key tomorrow? • Don’t forget: • The Internet was originally an overlay on the telephone network • There is no money to be made in the bit-shipping business • A modest goal for DB research: • Don’t query the Internet.

  12. The Evolution of P2P systems • First generation – centralized P2P systems • E.g. Napster, SETI@home • Second generation –decentralized & unstructured P2P systems • E.g. Gnutella • Third generation—structured P2P systems • DHT systems (CAN/Chord/Pastry/Tapestry) • Skip-list based systems • ….

  13. Unstructured P2P Systems • P2P with Central Servers • P2P with fully Autonomous Peers (pure p2p) • P2P with Superpeers (SuperNodes)

  14. Who has X? B has X Get X Reply with X A B Directory Server Unstructured Centralized P2P Systems -- Napster • Searching is efficient, with only a few messages exchanged; • Non-scalable, a central point of failure;

  15. Harnessing Idle CPU Cycles – SETI@HOME

  16. Unstructured Fully Decentralized -- Gnutella • Searching is inherently flooding (unscalable); • Time-to-Live(TTL) is used to partially address this problem;

  17. Techniques for improving search in Gnutella-like Network • Expanding Ring; • Random Walks; • Good Peer; • Local indices; • Routing indices;

  18. Freenet

  19. Worst Case for Freenet • Peer F has the requested file, but never finds it because a poor routing • decision made at Peer D, and results in the query not being matched. In this case, query will be rerouted once again with alternate path

  20. Unstructured P2P with Supernodes • Combine the benefits of centralized and decentralized search; • Take advantage of the heterogeneity of peer capabilities;

  21. Morpheus Supernode Layer

  22. What is Grid? “A hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” -- Ian Foster & Kal Kesselman, 1998 “Sharing enviorment implemented via the deployment of a persistent, standards-based service infrastructure that supports the creation of, and resource sharing within distributed communities” --Ian Foster & Adriana Iamnitchi, 2003

  23. A basic concept in Grid -- “Virtual Organization”

  24. The evolution of Grid Systems • First generation systems involved proprietary solutions for sharing high performance computing resources; e.g. Condor • Second generation systems introduced middleware to cope with scale and heterogeneity, with a focus on large scale computational power and large volumes of data; e.g. Globus, Eu DataGrid • Third generation systems are adopting a service-oriented approach, adopt a more holistic view of the e-Science infrastructure, are metadata-enabled and may exhibit autonomic features. • Open Grid Services Architecture (OGSA)

  25. P2P vs. Grid --similarities • Both P2P and Grid address the same problem, share the same goal • Resource sharing within distributed resources. • Both offer promising paradigms for developing distributed systems and applications

  26. P2P vs. Grid --differences • Resources • Grid– higher-end resources, better connected with high levels of availability • P2P– edge level devices, intermittently connected with highly variable availability

  27. P2P vs. Grid --differences • Services • Dependent on the nature of communities • Eg 1. Resource Discovery • Grid—very well structured and stable network making this less of an issue • P2P—unstable network • Eg 2. Security • Grid—authentication, authorization, accountability • P2P—anonymity, censorship resistance

  28. P2P vs. Grid --differences • Infrastructure • Grid – more emphasis in standardization, interoperability • P2P – little emphasis, no interoperability • Applications • Grid – large range of applications, more computation and data intensive • P2P – more social-based, less computation and data intensive

  29. P2P vs. Grid --differences • Scalability • Grid– Most services, such as resource discovery, are mainly based on centralized or hierarchial models • P2P– Most P2P systems are decentralized

  30. P2P vs. Grid --summary • Grid needs to address more in decentralization, self-organization, fault tolerance, and scalability issues, which are strong points of P2P. • P2P should put more effort on standard infrastructure and provide more services. • The P2P model could help to ensure Grid scalability • Two technologies are likely to converge (grid + structured p2p)

  31. Data sharing in P2P systems • Provide only file-level sharing, and lack of content-based search • coarse granularity of information sharing. • Lack of extensibility and flexibility • no easy and rapid means to expand applications • Node’s neighbors are typically statically defined • difficult to utilize network bandwidth and optimize system performance

  32. Relational data sharing in Unstructured P2P vs. Distributed DB

  33. P2P & DB Systems DB P2P Taken from Hellerstein’s group ppt

  34. P2P + DB = ? • P2P Database? No! • ACID transactional guarantees do not scale, nor does the everyday user want ACID semantics • Much too heavyweight of a solution for the everyday user • Query Processing on P2P! • Both P2P and DBs do data location and movement • Can be naturally unified (lessons in both directions) • P2P brings scalability & flexibilityDB brings relational model & query facilities Taken from Hellerstein’s group ppt

  35. Many New Challenges • Relative to other parallel/distributed systems • Partial failure • Churn • Few guarantees on transport, storage, etc. • Huge optimization space • Network bottlenecks & other resource constraints • No administrative organizations • Trust issues: security, privacy, incentives • Relative to IP networking • Much higher function, more flexible • Much less controllable/predictable

  36. Some Proposals on Data Sharing… • Database: • Data Mapping (SIGMOD’03) • Piazza (ICDE’03) • PeerDB(ICDE’03) • … • IR: • PlanetP((HPDC’03) • SummaryIndex (TKDE’04 special issue on P2P) • …

  37. The Birth of BestPeer… • Started in 1998 • To steal storage and CPU cycles from staff machines • To provide a virtual and parallelised content-based document retrieval system • To be able to move processes from one PC to another quickly when users need the PC back • Extended to P2P in early 2000 • VC showed interested in the project • W.S. Ng, B. C. Ooi and K.L. Tan: BestPeer: A self configurable peer-to-peer system. ICDE’2002.

  38. BestPeer Network • BestPeer is a generic P2P system designed to serve as a platform on which P2P applications can be developed easily and efficiently • Integrate mobile agent with P2P technologies • Each participant runs BestPeer software • Provide communication facilities and share resources with other peers • Provide an environment in which agent can reside and perform their tasks

  39. BestPeer Network cont… • Large # of peers, Small # of LIGLO; • Each node comprises of two types of data: private data and sharable data; • New node registration: • Register with LIGLO • Obtain a unique BPID from LIGLO. • LIGLO sends a list of (BPID, IP) pairs that node can communicate directly. • Node is ready to communicate to other peers.

  40. BestPeer Network cont… • Node Rejoins: • Send node’s current IP to LIGLO • For each peer of the node, p, send p’s BPID to its registered LIGLO • p’sregistered LIGLO will reply with IP of p if it is currently connected to the network • Node has rejoined

  41. BestPeer Network cont… • Access Data from other nodes: • Propagation broadcast • Node with matching result will respond to initiating node directly • Two modes to access data: • Phase 1: Node with matching answer will return the result directly or Node with matching answer will only indicate that they have the information • Phase 2: The initiating node will then send a further message to some, if not all, of these nodes to obtain desired information

  42. Reconfigurable BestPeer Network • A node in the BestPeer network can dynamically reconfigure itself by keeping peers that benefit it most. • Based on assumption: peers that benefit a node most for a query are most likely to provide the greatest gain for subsequent query. • Every node has its control of maximum number of direct peers it can have

  43. Reconfigurable BestPeer Network cont… • BestPeer applies autonomous strategy, where each node tries to keep promising peers as closes as possible with no information exchange between peers. • BestPeer provides two default reconfiguration strategies: • MaxCount • Maximizes the number of objects a node can obtain from its directly connected peers. • MinHops • Minimizes the number of Hops that a node needs to travel

  44. Location-Independent Global Names Lookup Server (LIGLO) • To facilitate identification of a single node that may have different IP addresses at different occasion • LIGLO is a node that has a fixed IP and running LIGLO software • LIGLO: • Generates BestPeer Global Identity (BPID) • Maintains peer’s current status • LIGLO applies distributed approach, each LIGLO only needs to maintain its members’ name

  45. Features of BestPeer • Combines the power of agent technology and P2P technology in a single system • Supports a finer granularity of data sharing, and sharing of computational power • Facilitates dynamic reconfiguration of BestPeer network • Adopts a distributed approach to minimize bottlenecks of servers acting as LIGLO

  46. Integrating of Mobile Agent and P2P Technologies • P2P technologies provide resources sharing capabilities among node; Mobile Agent further extends the functionalities • Java-based Agent System • BestPeer Search Agent vs. Traditional Search Agent: • (Trad) Predefined itinerary vs. Auto and transparent • TTL / Hops based lifetime • Result/Cost-based lifespan

  47. PeerDB • PeerDB is built on top of BestPeer • Four components that are integrated and implemented on the application layer. • Data management system • Facilitates storage, manipulation and retrieval of the data • MySQL as the backend for supporting SQL query facility • Local Dictionary • Metadata stored in Local Dictionary • Export Dictionary • Metadata sharable to other nodes • Cache Manager • Caching remote data in secondary storage • Caching/replacement policy • B.C. Ooi, K.L. Tan, A. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y. Shu, X.Y. Wang, M. Zhang: PeerDB: Peering into Personal Databases. SIGMOD’2003, Demo. • W.S. Ng, B. C. Ooi, K.L. Tan, A. Zhou: PeerDB: A P2P-based System for Distributed Data Sharing. ICDE’2003

More Related