1 / 38

Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks

This project explores a self-organizing search engine for distributed content-based search on structured peer-to-peer overlay networks. The goal is to build a scalable and efficient search engine capable of indexing and searching rich content such as HTML, plain text, music, and image files. Two algorithms, V-hash and E-hash, are proposed for controlled placement of document indices on the overlay network to improve search accuracy. Experimental results show that the system achieves comparable accuracy to centralized information retrieval systems with significantly lower resource consumption.

winship
Download Presentation

Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks Chunqiang Tang*, Zhichen Xu Sandhya Dwarkdas*, Mallik Mahalingam HP Labs Hewlett-Packard Company *Univ. of Rochester

  2. Motivation • 93% of information produced worldwide is in digital form • Unique data added yearly exceeds one exabytes (or 1018 bytes) • The volume of digital content is estimated doubling annually • The contents are becoming richer • Efforts are undertaken to make these contents easier to access (e.g.,QBIC, mpeg7) This calls for scalable infrastructures capable of indexing and searching rich content such as HTML, plain text, music, image files and so forth This particular work focus on content-based search Zhichen Xu

  3. Motivation (cont’d) • P2P systems scalable, fault-tolerant, self-organizing • Progress made in storage, DNS, media streaming, web caching… Raising hope for a self-organizing distributed search engine • Content-based search in P2P is NOT yet unsolved • Most systems use simple keyword matches • ignore developments in informational retrieval (IR) • Hard to do , e.g., search for a song by whistling a tune, or search for an image by submitting a sample of patches • They also have efficiency and accuracy problems • Centralized indexing, index/query flooding • Inaccuracy, high-maintenance cost of heuristic-based approaches Zhichen Xu

  4. The goals of our project • Build a self-organizing search engine out of P2P nodes • Extend centralized IR algorithms, • Vector space model (VSM) and latent semantic indexing (LSI) • Documents and queries as vectors; not specific to texts, [P. Raghaven] • Differences to “centralized” systems such as Google • Designed for Web search, harness explicit cross reference information • Explicit cross reference does not always exist in all digital contents • On the other hand, there are “richer” inter-relationships that the search engine can make use of [see our HotOS’03 paper] • P2P systems are self-organizing, low cost, easy of deployment, infinite scalability…. Zhichen Xu

  5. Our approach, pSearch • A fundamental problem of existing approaches: Documents are “randomly” distributed, a query either has to search a large number of nodes, or has to suffer high probability of missing important documents • Controlled placement of document indices in an overlay such that distances reflects the dissimilarity in content • Two algorithms: • V-hash (whole vector hashing) requires overlay to have Cartesian space abstraction (historically pLSI) • E-hash (hash on individual elements) (historically pVSM) Zhichen Xu

  6. Benefits of controlled placement for search • With VSM or LSI, documents and queries are vectors in a Cartesian (semantic space) • Similarity is measured as distance in the semantic space query A B C documents Zhichen Xu

  7. query A B CAN zones C documents V-hash: map the semantic space to CAN Zhichen Xu

  8. Highlight of results • Achieve an accuracy comparable to centralized information retrieval system by visiting a small number of nodes E.g.,with proper configuration, • A system with 128,000 nodes and 528,543 documents (from news, magazines, etc), • pSearch searches only 19 nodes and transmits only 95.5 KB data during the search, • the top 15 documents returned by v-hash and LSI have a 91.7% intersection Zhichen Xu

  9. Overview • Background • A basic parallel LSI (v-hash) algorithm to highlight challenges • Solutions to the challenges • Experimental results • Discussions • Conclusions Zhichen Xu

  10. Background---VSM and LSI • Documents and queries are vectors in a Cartesian space • Similarity between a query and a document is measured as the cosine of the angle between their vector representations • Precision of LSI ranges from comparable to up to 30% better than that of VSM • LSI can bring together documents that are semantically related even if they do not share terms • e.g., a search for car may return relevant documents that uses automobile in the text Zhichen Xu

  11. Background---The vector space model (VSM) • If a term t appears often in a document, then a query containing t should retrieval that document • A term’s scarcity across the collection is a measure of its importance • Documents and queries are both vectors • Di = (wi,1, wi,2, … wi,t) • Wd,t = tfd,t x idft tfd,t the frequency of t in document d; Idft inverse document frequency • There are many variations…. • Similarity: d . q/(|d|.|q|) Zhichen Xu

  12. Background---Latent Semantic Indexing (LSI) • Map term space to lower dimensional concept space • LSI --- Singular Value Decomposition (SVD) • Let A be an n x m matrix of rank r, 1  2  …rare the singular values of A • A = UDVT, where D = diag(1, 2 , …,r) is an r x r matrix, U = (u1, …, ur) is an n x r matrix, and V = (v1, …, vr) is an m x r matrix • LSI omits all but the k largest singular values of A, i.e., • Ak=Uk Dk VkT, where Dk = diag(1, 2 , …,k), Uk = (U1, …, Uk)andVk = (v1, …, Vk) Zhichen Xu

  13. Background --- CAN [Ratnasamy01] zone node • Cartesian space partitioned into zones • A node serves as “owner” of a zone • A key is a “point” in the Cartesian space • Object stored on node that owns the zone that contains the point (key) Zhichen Xu

  14. Low maintenance cost & self-organizing… new zone new node • A node only needs to know the owners of its neighboring zones • Node join: pick a point and split zone with node currently owns the point • Node departure: a neighboring node takes over “state” of the departing node • Dynamisms are shielded from the users and applications! Zhichen Xu

  15. Object lookup translates to logical routing 1 2 3 • Find the node who is the owner of the zone that contains the point • Routing: traverse a series of neighboring zones from source to destination Zhichen Xu

  16. A basic parallel LSI algorithm (naïve v-hash) • 1: query routing • 2: local query + localized flooding • 3: results routing 1 CAN zones query A B 3 2 2 3 C documents Zhichen Xu

  17. It is more complicated … • Dimensionality of semantic space typically very high • 50-350 for IR corpuses; expect to increase as the corpus size • Nearest neighbor search in a high dimension is very difficult • Dimensionality of CAN is much lower • When k  log(n) and zones are partitioned evenly, each node has only log (n) neighbors • CAN can only partition a small number of dimensions • Uneven distribution of semantic vectors in semantic space • Global information Solutions: hierarchical clustering, rolling-index and content-directed search Zhichen Xu

  18. Problems due to dimension mismatch: an example • Semantic space of 4 dimensions • Vd = (-0.1, 0.55, 0.57, -0.6), Vq = (0.55, -0.1, 0.6, -0.57) • Vd and Vq are similar on elements 2 and 3 (in red) • If CAN only partitions the first two dimensions 1 Vd Vq 1 -1 Zhichen Xu

  19. Intuitions behind our solutions • The dimensions relevant to a particular document is typically a much smaller number • Queries submitted to search engines can be very short, averaging less than 2.4 terms per query [Lempel & Moran] Zhichen Xu

  20. Our solutions • Use clustering algorithms to identify the clusters of semantic vectors that corresponds to e.g., chemistry, computer science, etc. [Not yet evaluated] • Rotate the semantic space and map each of the rotated space to the same CAN • Use the contents stored on the neighboring nodes and queries received in the recent past to guide search Zhichen Xu

  21. Hierarchical clustering- high-level idea 1 cluster digest 2 cluster cluster cluster cluster digest 2.3 digest 2.4 digest 1 digest 3 0 CAN cluster cluster digest 2.2 digest 2.1 0.5 0.5 CAN cluster cluster digest 1.3 digest 1.4 cluster cluster 0.25 digest 1.2 digest 1.1 0 CAN 0 0.25 CAN 0.5 • Digests are typically made of most important concepts (terms) in a domain • Challenge: efficiently/effectively decide which cluster a document/query falls into Zhichen Xu

  22. e0, e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11 e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e0, e1 Rolling-index Original vector for a document (or query) (e0, e1) e9, e10, e11, e0, e1, e2, e3, e4, e5, e6, e7, e8 Vector rotated by 2-elements (e2, e3) (e9, e10) Zhichen Xu

  23. An example of rolling-index • Semantic space of 4 dimensions • Vd = (-0.1, 0.55, 0.57, -0.6), Vq = (0.55, -0.1, 0.6, -0.57) • Vd and Vq are similar on elements 2 and 3 (in red) 1 1 Vd Vd Vq Vq 1 1 -1 -1 Precision at the cost of replication Zhichen Xu

  24. Properties of SVD • Sorts elements in semantic vectors by decreasing importance. A large number of documents discussing popular concepts are likely to be correctly classified by a relative small number of low-dimension elements Zhichen Xu

  25. Query accuracy distribution time • A total of 100 queries • 4 rotated spaces, each rotated the previous space by 25 • Accuracy: percentage overlap with a centralized baseline Zhichen Xu

  26. Content-directed search • Curse of dimensionality • High-dimensional data spaces are sparsely populated. Even very large hyper-cube in high-dimensional spaces are not likely to contain a point • The distance between a query and its nearest neighbor (NN) grows steadily with the dimensionality of the space • Use the contents stored on nodes and recently processed queries as a hint to guide the search to the right places • Uses samples from other nodes to determine content similarity between a query and content stored on the nodes Zhichen Xu

  27. Content-directed search • Search for two documents • N: list of nodes to search • Step 1: N = {6,14,11,9} • Step 2: a is identified and N = {7,14, 11, …} • Closest document may not be on direct routing neighbor 1 2 3 4 5 6 7 8 a b 9 10 11 12 q 13 14 15 16 Zhichen Xu

  28. Content-directed replication & caching • Selectively replicate contents stored on surrounding nodes • The threshold is set according to the node’s storage capacity, computing power, and network connectivity 1 2 3 4 5 6 7 8 a b 9 10 11 12 q 13 14 15 16 Zhichen Xu

  29. Experimental Results • Software packages • SMART [Cornell] + LAS2 from SVDPACK [netlib]+eCAN sim • Validate the correctness using MEDLINE corpus [Buckley] • Experiment with TREC-7,8; Topics 351-450 as queries • term by document matrix by sampling 15% documents • 79,316 sampled docs and 83,098 indexed terms • Project all 528,543 docs onto 300 dimensions after SVD • Metrics • Number of visited nodes • Accuracy = (|A  B| / |A|) x 100%, A : set of documents returned by LSI, and B: set of documents returned by v-hash Zhichen Xu

  30. Scalability with respect to the system size • As system size increases exponentially, the number of visited nodes increases only moderately • For 32k system, v-hash can achieve an accuracy of 90% by visiting 139 nodes Zhichen Xu

  31. Effect of the number of returned documents • 10,000 nodes in total • The number of visited nodes grows quickly, but the average number of nodes that needs to be searched to return one document decrease drastically Zhichen Xu

  32. Using actual contents and past queries to direct searches When queries have locality, learning from past history can increase the accuracy while reducing the number of visited nodes Zhichen Xu

  33. Replication improves search efficiency and accuracy • Visit 24 nodes in a 10,000 node system to achieve accuracy higher than 96.8% • Replicating direct neighbor’s content. The scalability declines from O(n) to O(n/log(n)) Zhichen Xu

  34. An example of a large system of 128 K nodes • Repl-query series uses both the content and past queries to guide the sampling • Combining replication and the query heuristics, it can achieve an accuracy of 91.7% by visiting 19 nodes, or an accuracy of 98% by visiting 45 nodes Zhichen Xu

  35. Discussions • V-hash requires the overlays to have Cartesian space abstraction • for an individual doc, query, the most significant elements may not be contiguous • clustering is needed for larger corpus element hashing (e-hash) algorithm eliminates the constraint • We expect the content-directed search to improve as the size of the corpus size increases, • Selective content replication and query result caching have the potential to substantially improve the performance while keeping the scalability of the storage high • Study how other IR algorithms such as PageRank can complement our approach • Integrate attribute-based, content-based, and context-based search Zhichen Xu

  36. E-hash • Query = global_rank (sigma (local_ranking)) • Intelligent storage management based on query patterns Computer: w1 Network: w2 … sports: w1 Network: w2 … Overlay e.g., Chord, Pastry Zhichen Xu

  37. Conclusion • pSearch is the first system that organizes contents around their semantic in a P2P network. • This makes it possible to achieve an accuracy comparable to state-of-the-art centralized IR systems while visiting only a small number of nodes. • We propose the use of hierarchical clustering, rolling-index, and content-directed search to reduce the dimensionality of the search space and to resolve the dimensionality mismatch between semantic space and CAN • We employ content-aware node bootstrapping to balance the load Zhichen Xu

More Related