text based content search and retrieval in ad hoc p2p communities n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text-Based Content Search and Retrieval in ad hoc P2P Communities PowerPoint Presentation
Download Presentation
Text-Based Content Search and Retrieval in ad hoc P2P Communities

Loading in 2 Seconds...

play fullscreen
1 / 17
amos-bishop

Text-Based Content Search and Retrieval in ad hoc P2P Communities - PowerPoint PPT Presentation

115 Views
Download Presentation
Text-Based Content Search and Retrieval in ad hoc P2P Communities
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/

  2. Motivation • It is hard to find information in current P2P infrastructures • They are designed for name-based search • They don’t have quality metrics • They don’t rank results • Most are optimized to find popular content • The current Internet search model has proven to be effective to locate data • Intuitive term-based query model • Quality metric and ranking critical factors in success of Internet search engines • Help users to quickly pinpoint relevant documents from vast repository

  3. Goals & challenges • Empower P2P communities with search capabilities similar to Internet search engines • No central servers • Fault tolerance • Cannot employ current model used by Internet search engines • No central management and administration • Resources are fragmented • Peers behaviors are uncontrolled

  4. Summary of PlanetP • Nodes maintain an index of their content • Represented as Bloom filters • Indexes and Directories are replicated everywhere • Gossiping keeps peers synchronized Local Directory Local Directory Local Files Local Files XML Snippets Gossiping [K1,..,Kn] XML Snippets [K1,..,Kn] Bloom filter Bloom filter

  5. Rankresults Local lookup Contactcandidates Ranknodes Local Directory Diane Nickname Keys Alice [K1,..,Kn] Diane Bob Bob [K1,..,Kn] Fred File1 Query Charles [K1,..,Kn] File2 Diane [K1,..,Kn] Fred Diane File3 Edward [K1,..,Kn] Fred [K1,..,Kn] Bob Fred Gary [K1,..,Kn] Content search in PlanetP STOP

  6. Document Query The Vector Space model • Documents and queries are represented as k-dimensional vectors • Word are weighted according to their relevance for the document • Documents are weighted according to their words • The angle between a query and a document indicates its similarity

  7. Weight assignment (TFxIDF) • Idea • Use per doc. Term Frequency (TF) to weight words (WD,t) • Use inverse global popularity (IDF) to find good discriminators among the query terms • Intuition • TF indicates how related a document is to a particular concept • Inverse Document Frequency (IDF) identify the words that are good discriminators between documents • WD,t=f(Frequency of t in D) • IDFt=f(No. documents/Frequency of t across documents)

  8. Node & document ranking in PlanetP • Unfortunately IDF is not suited for P2P • Requires an appearance count for every word in the community • We introduce the use of the Inverse Peer Frequency • IPFt=f(No. Peers/Peers with documents containing t) • IPF can be computed with local information • IPF is compatible across the community

  9. Stopping condition • Intuitive idea: Stop as soon as k documents are retrieved • Not good • A node might have few highly ranked documents and many that have a low rank • We propose an adaptive approach: • Contact nodes one by one and keep a list of the top k documents retrieved • Stop contacting candidates when p nodes in a row fail to contribute to the top k

  10. Evaluation method • We use five well known document collections • Each collection comes with a set of queries and relevance judgments • Here we present results for one (AP89) • We measure recall and precision

  11. Evaluation method • We use a simulator to test our algorithm • Different file distributions • Against a central search engine • Quantifying the effect not using an adaptive stopping condition

  12. Results

  13. Results cont.

  14. More results • Adjusting the stop condition according to the community size and number of results expected • We provide a linear function to determine p • Recall as the community grows to 1000 (scalability) • Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF • 80% on average

  15. Conclusions • PlanetP matches TFxIDF's performance using the TFxIPF approximation • Give P2P communities search capabilities as powerful as environments with centralized resources • TFxIPF is applicable beyond PlanetP • PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community • Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results

  16. Related Work • Tapestry, Pastry, Chord and CAN • Implement a distributed hash table for P2P environments • Oriented towards name based searches (for FS) • They already store all the information needed to implement TFxIPF • Cori and Gloss • Address the problem of indexing and searching distributed collections of documents • They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes

  17. Questions?