1 / 44

Information Retrieval on P2P Networking

Information Retrieval on P2P Networking. Willie Yang November 2004. 1. Information Retrieval. What is Information Retrieval ?. Select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user【Salton, 1989】

kelton
Download Presentation

Information Retrieval on P2P Networking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval on P2P Networking Willie Yang November 2004

  2. 1. Information Retrieval

  3. What is Information Retrieval ? • Select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user【Salton, 1989】 • Basic model【Belkin, 1992】

  4. 1. Format, Source, Type 3. Query Expression 2. Indexing 4. Query Model 5. Ranking 6. Feedback Research Issues

  5. More about Information Retrieval Concepts related to searching - Browsing - Filtering Technologies related to IR - Information extraction - Question answering - Classification

  6. 2. Peer-to-peer Networking

  7. What is Peer-to-peer Networking ? • Peer-to-peer is a way of structuring distributed applications such that the individual nodes have symmetric roles. Rather than being divided into clients and servers each with quite distinct roles, in P2P applications a node may act as both a client and a server.【IETF/IRTF 2004】

  8. Characteristics of P2P (1) • Multiple peers participating in the network • The number of roles is small. • The number of peers is typically large. • Every peer owns some resources and pays its participation by providing access to its resources. • Distributed, decentralized, no distinguished roles • Autonomous, self-control, ad hoc participation. • Dynamic (e.g. come and go freely) • Rely very little on the underlay infrastructure. →do most things on their own.

  9. Characteristics of P2P (2) • Difference from distributed computing • More dynamic (fail or not v.s. join or leave) • Larger number • Difference from distributed database, or grid computing. • No centralized mechanism (i.e.integrator or dispatcher, etc.) • Research highlights • Resource sharing • Autonomous • Load balancing

  10. When P2P paradigm is introduced..

  11. 3. Information Retrieval on P2P Networking

  12. X Where is X? Search on unstructured P2P • Example: Gnutella • Solution: Broadcasting + TTL • Constraints: non-guarantee search • Research topics • - Exploring strategies • - Linking strategies • - Routing strategies

  13. 8 1 7 Node joining : assign node id 2 6 3 5 4 Search on Structured P2P Where is X? • Example: Chord, a kind of DHT P2P • Solution: Consistent Hashing + Routing • Constraints: only support Key-value pair lookup Object publishing : hash(X) = 3 X Object look up : the same • Research topics • - Topology and Routing • - Efficiency X

  14. DOC 1 DOC 1 網路 DOC 3 DOC 2 DOC 6 8 DOC 4 DOC 8 1 資管 7 台灣 2 6 DOC 3 DOC 5 3 DOC 7 5 4 Keyword Search on Structured P2P • Example: Chord + Inverted List • Solution: Routing + Merge Sort • Constraints: (1) storage redundancy (2) unbalanced load → Zipf’s law (3) single point failure (4) huge traffic (5) hard to rank the results Where is 台灣 & 資管? DOC 1

  15. Keyword Search in DHT-Based Peer-to-Peer Networks Yuh-Jzer Joung, Chien-Tse Fang, and Li-Wei Yang

  16. Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions and Related Work

  17. 0010000 1010000 DOC 1 1011000 DOC 2 0 0 1 0 0 0 0 Doc1 (keyword 台灣) Doc2 (keyword 台灣, 網路) Doc3 (keyword 台灣, 網路, 資管) 1 0 1 0 0 0 0 1 0 1 1 0 0 0 DOC 3 Our Hypercube Indexing Scheme • Assign node id : a r-bit string • Hash each keyword into range [0,r] to construct a doc vector • Publish doc to the node where doc vector = node id Hash(台灣) = 2 Hash(網路) = 0 Hash(資管) = 3

  18. 0100 1100 0101 1101 0000 1000 0001 1001 0110 1110 0111 1111 1010 0010 0011 1011 Hypercube • An r-dimensional hypercube Hr(Vr, Er) has 2r nodes. Each node u in Vris represented by a unique r-bit binary string. • Two nodes u, v in Vr has an edge iff differ at exactly one bit. • An r-D hypercube can be constructed by 2 (r1)-D hypercubes

  19. Spanning Binomial Tree Search and broadcast in hypercube can be done via traversing the spanning binomial tree.

  20. Subhypercube • A subhypercube of Hr(Vr, Er) induced by u, denoted by Hr(u), is a subgraph G=(U, F) of Hr such that every node wVr is in U if and only if w contains u, and every edge eEr is in F if and only if its two end points are in U. H3 H4(0100)

  21. Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions

  22. System Model

  23. 0 0 1 0 0 1 0 0 0 0 1 0 … 1 0 0 0 Our Index Scheme • A conceptual r-D hypercube is built over the DHT to index objects. • Each object o with keyword set Ko is mapped to a unique r-bit vector by a hash h as follows: Object o Ko={w1, w2, …, wk} h: W  {0, 1, …, r-1} h(w2)=6 h(w1)=1 0 r-1 Fh(Ko) The node Fh(Ko) in the hypercube is responsible for indexing o.

  24. 0100 1100 1101 0101 1000 0000 1001 0001 1110 0110 1111 0111 0010 1010 0011 1011 Index Table(0101) {w1, w2} {(A, u), …} {w1, w7} … … … Object Insert/Delete/Pin Search • To insert/delete an object o with keyword set Ko into the system • Find node Fh(Ko) that is responsible for o • Insert/delete index information of o at the node. Object A KA={w1, w2} u publishes A u Fh(KA)=0101 Fh(KA)=0101 x Any object of {w1, w2}

  25. Superset Search • To search objects that can be described by a keyword set K (object o with Ko K) we need just to search the subhypercube induced by the node Fh(K). • E.g.,to search objects that can be described by KA={w1, w2}, we need to search all nodes with x1x1 (since Fh(KA)=0101).

  26. Flexible Superset Search • The spanning binomial tree of the subhypercube can be visited in various ways: • Top-down • General objects first • Bottom-up • Specific objects first • Priority can also be distinguished by nodes at the same depth • Note that the hypercube is purely conceptual; each logical node corresponds directly to a physical node in the DHT. So tree traverse can be flexible as the underlying DHT provides the basic communication.

  27. Simulation • Data set • 131,180 web site records from PCHome (http://www.pchome.com.tw) • Each Web site is maintained manually by experienced editors containing the following fields: • ID, Title, URL, Category, Description, Keyword

  28. Distribution of Keyword set sizes

  29. Keyword Frequency Logarithm in base e

  30. Load Distribution

  31. Object vs. node Distribution X-axis: dimensionality r of hypercube

  32. Query Performance---cacheless m: keyword set size

  33. Query Performance---with cache

  34. Query Distribution

  35. Conclusions • Our hypercube index scheme has the following characteristics: • Load balancing • Fault tolerant • Facilitate efficient object insert/delete • Direct pin search • A variety of ways for superset search • Ranking can be based on this diversity • Personalization services can also be built • The hypercube index scheme is decomposable • Multiple hypercubes can be built for multi-attribute search

  36. Future Challenges • Flexible Keyword Search • Boolean • Prefix / Range Query • Wildcard / Fuzzy Query • Semantic Query • Semantic Routing

  37. 補充資料

  38. Two Types of Services • White page service • search by names • “Lord of the rings.mpg” • Yellow page service • search by attributes • “rings”, “lord”, “mpg” • Keyword search is the basis for yellow page services • Both services can be easily supported in unstructured P2Ps or P2Ps with a centralized server. Yellow page service, however, is not easy in DHTs.

  39. Common Technique: Inverted Index

  40. w2 w5 w3 w1 w4 {A, C, E} {B, D} {B, E} {A, B, D} {C, E} Keywords={W1, W5} Distributed Inverted Indexing {A, B, D}∧{B, D}

  41. Zipf 's law • In a real world corpus, keyword frequency---the count of a keyword's occurrence in objects---varies enormously. A few keywords occur very often while many others occur rarely (in power-law relationship). • e.g., mp3, ring, lord • Zipf’s law implies that a straightforward distributed implementation of inverted index results in an extremely imbalanced load.

  42. Other Problems • Storage redundancy • an object o contains keywords {w1, w2, …, wk} is repeatedly stored at k different sites. • Increase insert/delete complexity • Decrease consistency • Fault tolerance • A failure to a site would block all queries containing a keyword handled by the site. • Nodes handling hot keywords may be swamped. • Object ranking is difficult • Ranking in general requires global knowledge • inverse document frequency (IDF)

  43. Our Keyword Indexing Scheme • The index entries of a single keyword are deterministically handled by a set of nodes. • Fault tolerance • The population of this set depends on the popularity of the keyword • Load Balancing • An object o with a keyword set K is indexed at exactly one node, and the node is determined uniquely by K • No storage redundancy • Insert/delete is efficient

  44. Ranking • Given a keyword set K, the set SK of nodes that may be responsible for a superset of K is fixed. The larger the size of K, the smaller the size of SK. • Within SK, the nodes are distinguished according to their responsible keyword sets as follows: • K+{w1}, K+{w2}, K+{w3}, … • K+{w1,w2}, K+{w1,w3}, K+{w1,w4}, … K+{w2,w1}, … • K+{w1,w2,w3}, K+{w1,w2,w4}, K+{w1,w2,w5}, … • … • So, much leeway in visiting the nodes to retrieve objects in an order required by applications.

More Related