1 / 17

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Text-Based Content Search and Retrieval in ad hoc P2P Communities. Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/. Motivation. It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics

amos-bishop
Download Presentation

Text-Based Content Search and Retrieval in ad hoc P2P Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/

  2. Motivation • It is hard to find information in current P2P infrastructures • They are designed for name-based search • They don’t have quality metrics • They don’t rank results • Most are optimized to find popular content • The current Internet search model has proven to be effective to locate data • Intuitive term-based query model • Quality metric and ranking critical factors in success of Internet search engines • Help users to quickly pinpoint relevant documents from vast repository

  3. Goals & challenges • Empower P2P communities with search capabilities similar to Internet search engines • No central servers • Fault tolerance • Cannot employ current model used by Internet search engines • No central management and administration • Resources are fragmented • Peers behaviors are uncontrolled

  4. Summary of PlanetP • Nodes maintain an index of their content • Represented as Bloom filters • Indexes and Directories are replicated everywhere • Gossiping keeps peers synchronized Local Directory Local Directory Local Files Local Files XML Snippets Gossiping [K1,..,Kn] XML Snippets [K1,..,Kn] Bloom filter Bloom filter

  5. Rankresults Local lookup Contactcandidates Ranknodes Local Directory Diane Nickname Keys Alice [K1,..,Kn] Diane Bob Bob [K1,..,Kn] Fred File1 Query Charles [K1,..,Kn] File2 Diane [K1,..,Kn] Fred Diane File3 Edward [K1,..,Kn] Fred [K1,..,Kn] Bob Fred Gary [K1,..,Kn] Content search in PlanetP STOP

  6. Document Query The Vector Space model • Documents and queries are represented as k-dimensional vectors • Word are weighted according to their relevance for the document • Documents are weighted according to their words • The angle between a query and a document indicates its similarity

  7. Weight assignment (TFxIDF) • Idea • Use per doc. Term Frequency (TF) to weight words (WD,t) • Use inverse global popularity (IDF) to find good discriminators among the query terms • Intuition • TF indicates how related a document is to a particular concept • Inverse Document Frequency (IDF) identify the words that are good discriminators between documents • WD,t=f(Frequency of t in D) • IDFt=f(No. documents/Frequency of t across documents)

  8. Node & document ranking in PlanetP • Unfortunately IDF is not suited for P2P • Requires an appearance count for every word in the community • We introduce the use of the Inverse Peer Frequency • IPFt=f(No. Peers/Peers with documents containing t) • IPF can be computed with local information • IPF is compatible across the community

  9. Stopping condition • Intuitive idea: Stop as soon as k documents are retrieved • Not good • A node might have few highly ranked documents and many that have a low rank • We propose an adaptive approach: • Contact nodes one by one and keep a list of the top k documents retrieved • Stop contacting candidates when p nodes in a row fail to contribute to the top k

  10. Evaluation method • We use five well known document collections • Each collection comes with a set of queries and relevance judgments • Here we present results for one (AP89) • We measure recall and precision

  11. Evaluation method • We use a simulator to test our algorithm • Different file distributions • Against a central search engine • Quantifying the effect not using an adaptive stopping condition

  12. Results

  13. Results cont.

  14. More results • Adjusting the stop condition according to the community size and number of results expected • We provide a linear function to determine p • Recall as the community grows to 1000 (scalability) • Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF • 80% on average

  15. Conclusions • PlanetP matches TFxIDF's performance using the TFxIPF approximation • Give P2P communities search capabilities as powerful as environments with centralized resources • TFxIPF is applicable beyond PlanetP • PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community • Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results

  16. Related Work • Tapestry, Pastry, Chord and CAN • Implement a distributed hash table for P2P environments • Oriented towards name based searches (for FS) • They already store all the information needed to implement TFxIPF • Cori and Gloss • Address the problem of indexing and searching distributed collections of documents • They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes

  17. Questions?

More Related