Text based content search and retrieval in ad hoc p2p communities
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Text-Based Content Search and Retrieval in ad hoc P2P Communities PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Text-Based Content Search and Retrieval in ad hoc P2P Communities. Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/. Motivation. It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics

Download Presentation

Text-Based Content Search and Retrieval in ad hoc P2P Communities

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text based content search and retrieval in ad hoc p2p communities

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Francisco Matias Cuenca-Acuna

Thu D. Nguyen

http://www.panic-lab.rutgers.edu/


Motivation

Motivation

  • It is hard to find information in current P2P infrastructures

    • They are designed for name-based search

    • They don’t have quality metrics

    • They don’t rank results

    • Most are optimized to find popular content

  • The current Internet search model has proven to be effective to locate data

    • Intuitive term-based query model

    • Quality metric and ranking critical factors in success of Internet search engines

      • Help users to quickly pinpoint relevant documents from vast repository


Goals challenges

Goals & challenges

  • Empower P2P communities with search capabilities similar to Internet search engines

    • No central servers

    • Fault tolerance

  • Cannot employ current model used by Internet search engines

    • No central management and administration

    • Resources are fragmented

    • Peers behaviors are uncontrolled


Summary of planetp

Summary of PlanetP

  • Nodes maintain an index of their content

    • Represented as Bloom filters

  • Indexes and Directories are replicated everywhere

  • Gossiping keeps peers synchronized

Local Directory

Local Directory

Local

Files

Local

Files

XML

Snippets

Gossiping

[K1,..,Kn]

XML

Snippets

[K1,..,Kn]

Bloom filter

Bloom filter


Content search in planetp

Rankresults

Local lookup

Contactcandidates

Ranknodes

Local Directory

Diane

Nickname

Keys

Alice

[K1,..,Kn]

Diane

Bob

Bob

[K1,..,Kn]

Fred

File1

Query

Charles

[K1,..,Kn]

File2

Diane

[K1,..,Kn]

Fred

Diane

File3

Edward

[K1,..,Kn]

Fred

[K1,..,Kn]

Bob

Fred

Gary

[K1,..,Kn]

Content search in PlanetP

STOP


The vector space model

Document

Query

The Vector Space model

  • Documents and queries are represented as k-dimensional vectors

    • Word are weighted according to their relevance for the document

    • Documents are weighted according to their words

  • The angle between a query and a document indicates its similarity


Weight assignment tfxidf

Weight assignment (TFxIDF)

  • Idea

    • Use per doc. Term Frequency (TF) to weight words (WD,t)

    • Use inverse global popularity (IDF) to find good discriminators among the query terms

  • Intuition

    • TF indicates how related a document is to a particular concept

    • Inverse Document Frequency (IDF) identify the words that are good discriminators between documents

  • WD,t=f(Frequency of t in D)

  • IDFt=f(No. documents/Frequency of t across documents)


Node document ranking in planetp

Node & document ranking in PlanetP

  • Unfortunately IDF is not suited for P2P

    • Requires an appearance count for every word in the community

  • We introduce the use of the Inverse Peer Frequency

    • IPFt=f(No. Peers/Peers with documents containing t)

    • IPF can be computed with local information

    • IPF is compatible across the community


Stopping condition

Stopping condition

  • Intuitive idea: Stop as soon as k documents are retrieved

    • Not good

    • A node might have few highly ranked documents and many that have a low rank

  • We propose an adaptive approach:

    • Contact nodes one by one and keep a list of the top k documents retrieved

    • Stop contacting candidates when p nodes in a row fail to contribute to the top k


Evaluation method

Evaluation method

  • We use five well known document collections

    • Each collection comes with a set of queries and relevance judgments

    • Here we present results for one (AP89)

  • We measure recall and precision


Evaluation method1

Evaluation method

  • We use a simulator to test our algorithm

    • Different file distributions

    • Against a central search engine

    • Quantifying the effect not using an adaptive stopping condition


Results

Results


Results cont

Results cont.


More results

More results

  • Adjusting the stop condition according to the community size and number of results expected

    • We provide a linear function to determine p

  • Recall as the community grows to 1000 (scalability)

  • Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF

    • 80% on average


Conclusions

Conclusions

  • PlanetP matches TFxIDF's performance using the TFxIPF approximation

    • Give P2P communities search capabilities as powerful as environments with centralized resources

    • TFxIPF is applicable beyond PlanetP

    • PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community

  • Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results


Related work

Related Work

  • Tapestry, Pastry, Chord and CAN

    • Implement a distributed hash table for P2P environments

    • Oriented towards name based searches (for FS)

    • They already store all the information needed to implement TFxIPF

  • Cori and Gloss

    • Address the problem of indexing and searching distributed collections of documents

    • They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes


Questions

Questions?


  • Login