text based content search and retrieval in ad hoc p2p communities
Download
Skip this Video
Download Presentation
Text-Based Content Search and Retrieval in ad hoc P2P Communities

Loading in 2 Seconds...

play fullscreen
1 / 17

Text-Based Content Search and Retrieval in ad hoc P2P Communities - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Text-Based Content Search and Retrieval in ad hoc P2P Communities. Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/. Motivation. It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text-Based Content Search and Retrieval in ad hoc P2P Communities' - amos-bishop


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
text based content search and retrieval in ad hoc p2p communities

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Francisco Matias Cuenca-Acuna

Thu D. Nguyen

http://www.panic-lab.rutgers.edu/

motivation
Motivation
  • It is hard to find information in current P2P infrastructures
    • They are designed for name-based search
    • They don’t have quality metrics
    • They don’t rank results
    • Most are optimized to find popular content
  • The current Internet search model has proven to be effective to locate data
    • Intuitive term-based query model
    • Quality metric and ranking critical factors in success of Internet search engines
      • Help users to quickly pinpoint relevant documents from vast repository
goals challenges
Goals & challenges
  • Empower P2P communities with search capabilities similar to Internet search engines
    • No central servers
    • Fault tolerance
  • Cannot employ current model used by Internet search engines
    • No central management and administration
    • Resources are fragmented
    • Peers behaviors are uncontrolled
summary of planetp
Summary of PlanetP
  • Nodes maintain an index of their content
    • Represented as Bloom filters
  • Indexes and Directories are replicated everywhere
  • Gossiping keeps peers synchronized

Local Directory

Local Directory

Local

Files

Local

Files

XML

Snippets

Gossiping

[K1,..,Kn]

XML

Snippets

[K1,..,Kn]

Bloom filter

Bloom filter

content search in planetp
Rankresults

Local lookup

Contactcandidates

Ranknodes

Local Directory

Diane

Nickname

Keys

Alice

[K1,..,Kn]

Diane

Bob

Bob

[K1,..,Kn]

Fred

File1

Query

Charles

[K1,..,Kn]

File2

Diane

[K1,..,Kn]

Fred

Diane

File3

Edward

[K1,..,Kn]

Fred

[K1,..,Kn]

Bob

Fred

Gary

[K1,..,Kn]

Content search in PlanetP

STOP

the vector space model
Document

Query

The Vector Space model
  • Documents and queries are represented as k-dimensional vectors
    • Word are weighted according to their relevance for the document
    • Documents are weighted according to their words
  • The angle between a query and a document indicates its similarity
weight assignment tfxidf
Weight assignment (TFxIDF)
  • Idea
    • Use per doc. Term Frequency (TF) to weight words (WD,t)
    • Use inverse global popularity (IDF) to find good discriminators among the query terms
  • Intuition
    • TF indicates how related a document is to a particular concept
    • Inverse Document Frequency (IDF) identify the words that are good discriminators between documents
  • WD,t=f(Frequency of t in D)
  • IDFt=f(No. documents/Frequency of t across documents)
node document ranking in planetp
Node & document ranking in PlanetP
  • Unfortunately IDF is not suited for P2P
    • Requires an appearance count for every word in the community
  • We introduce the use of the Inverse Peer Frequency
    • IPFt=f(No. Peers/Peers with documents containing t)
    • IPF can be computed with local information
    • IPF is compatible across the community
stopping condition
Stopping condition
  • Intuitive idea: Stop as soon as k documents are retrieved
    • Not good
    • A node might have few highly ranked documents and many that have a low rank
  • We propose an adaptive approach:
    • Contact nodes one by one and keep a list of the top k documents retrieved
    • Stop contacting candidates when p nodes in a row fail to contribute to the top k
evaluation method
Evaluation method
  • We use five well known document collections
    • Each collection comes with a set of queries and relevance judgments
    • Here we present results for one (AP89)
  • We measure recall and precision
evaluation method1
Evaluation method
  • We use a simulator to test our algorithm
    • Different file distributions
    • Against a central search engine
    • Quantifying the effect not using an adaptive stopping condition
more results
More results
  • Adjusting the stop condition according to the community size and number of results expected
    • We provide a linear function to determine p
  • Recall as the community grows to 1000 (scalability)
  • Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF
    • 80% on average
conclusions
Conclusions
  • PlanetP matches TFxIDF's performance using the TFxIPF approximation
    • Give P2P communities search capabilities as powerful as environments with centralized resources
    • TFxIPF is applicable beyond PlanetP
    • PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community
  • Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results
related work
Related Work
  • Tapestry, Pastry, Chord and CAN
    • Implement a distributed hash table for P2P environments
    • Oriented towards name based searches (for FS)
    • They already store all the information needed to implement TFxIPF
  • Cori and Gloss
    • Address the problem of indexing and searching distributed collections of documents
    • They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes
ad