“Information Retrieval in Peer-to-Peer Systems”

Dept. of Computer Science & Engineering. @ University of California - Riverside “Information Retrieval in Peer-to-Peer Systems” Demetrios Zeinalipour-Yazti M.Sc. Thesis Defense Monday, May 5, 2003Surge 34912:00-1:00 PM Thesis Committee: Dr. Dimitrios Gunopulos, Chairperson Dr. Vana Kalogeraki Dr. Chinya V. Ravishankar http://www.cs.ucr.edu/~csyiazti/msc.html

Presentation Outline • Introduction & Motivation. • Search Techniques for P2P systems • The Intelligent Search Mechanism • PeerWare Simulation Infrastructure • Experimental Evaluation. • Conclusions & Future Work.

The virtual P2P topology The physical topology Introduction to Peer-to-Peer • Peer-to-Peer Computing definition: “Sharing of computer resources and information through direct exchange” • Clients (downloaders) are also servers • Clients may join or leave the network at any time => highly fault-tolerant but with a cost! • Searches are done within the virtual network while actual downloads are done offline (with HTTP).

Introduction to Peer-to-Peer • Peer-to-Peer (P2P) systems are increasingly becoming popular. • P2P file-sharing systems, such as Gnutella, Napster and Freenet realized a distributed infrastructure for sharing files. • Traditionally, files were shared using the Client-Server model (e.g. http). Not scalable since they are centralized services. • P2P uncover new advantages in simplicity of use, robustness, self organization and scalability.

keywords Information Retrieval in P2P Problem: “How to efficiently retrieve Information in P2P systems where each node shares a collection of documents?” • Documents consists of keywords. • Resembles Information Retrieval but resources are distributed now. • Primary Data Structures such as Global Inverted Indexes can’t be maintained efficiently.

Solutions for P2P Information Retrieval 1) Centralized Approaches • Centralized Indexes • e.g. Napster, SETI@HOME 2) Purely Distributed Approaches • Each node has only local knowledge. • I.R is done using Brute force mechanisms • e.g. Gnutella, Fasttrack (Kazaa) 3) Hybrid Approaches • One or more peers have partial indexes of the contents of others. • e.g. Limewire's Ultrapeers Centralized Index 1) Upload Index 2) Query/QueryHit 3) Download (offline) 1 2 3 1) Connect 2) Query/QueryHit 3) Download (offline) 1,2 3 1) Connect 2) IntelligentQuery/QueryHit 3) Download (offline) 1,2 3

Motivation • On 1st June we crawled the Gnutella P2P Network for 5 hours with 17 workstations. • We analyzed 15,153,524 query messages. • Observation: High locality of specific queries. • We try to exploit this property for more efficient searches?

Search Techniques for P2P systems • Breadth-First Search (Gnutella) • Idea: Each Query Message is propagated along all outgoing links of a peer using TTL (time-to-live). • TTL is decremented on each forward until it becomes 0 • Technique for I.R in P2P systems such as Gnutella. • Highlights • The physical network comes to its knees • Long Delays for search results. P2P Network N A QUERY 1 QUERYHIT 2 Peer q Peer d

Peer q Search Techniques for P2P systems • Modified Random BFS [V. Kalogeraki, D. Gunopulos, D. Zeinalipour-Yazti . CIKM2002] • Idea: Each Query Message is forwarded to only a fraction of outgoing links (e.g. ½ of them). • TTL is again decremented on each forward until it becomes 0. • Highlights • Fewer Messages but possibly less results • This algorithm is probabilistic. • Some segments may become unreachable unreachable B A QUERY 1 P2P Network N QUERYHIT C 2 Peer d

Search Techniques for P2P systems • Searching Using Random Walkers [Q. Lv et al P. Cao, E. Cohen, K. Li, and S. Shenker. ICS2002] • Idea: Each Query Message is forwarded to 1 neighbor • With k walkers after T steps we reach the same nodes as 1 walker after kT steps. (They use 16-64 walkers) • Highlights • Network Traffic reduced (from BFS) by 2 orders of magnitude • Increases the user-perceived delay (from 2-6 hops to 4-15 hops) • This algorithm is probabilistic and the likelihood to locate the objects depends on the network topology. Peer d

1 1 1 1 Search Techniques for P2P systems 4. Using Randomized Gossiping to Replicate Global State [F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12] • Idea: PlanetP uses Bloom Filters to propagate summary indexes of the contents of a Peer. • Bloom Filters are used for Membership Queries • Highlights • Not Scalable (Technique works well for <10000 nodes) • No Data Replication Required • False Positives are a function of m,n,k and can be kept small D = {d ,d ,...,d } 000 1 2 n 001 1 h (d ) 010 1 1 011 m h (d ) 2 1 100 1 h (d ) 3 1 101 d1? 110 h (d ) 1 4 1 111 1 An 8-bit bloom filter w/ 4 hash functions

Search Techniques for P2P systems 5. Searching using Local Indices [Arturo Crespo and Hector Garcia-Molina, ICDCS 2002.] • Idea: Create indices which contain “statistics” that reveal the “direction” towards the documents. • Types of Proposed Indices • Compound Routing Index (CRI): metric=number of documents • Hop-Count Routing Index (HRI): maintain a CRI for k hops, • Exponentially Aggregated Index (ERI): Apply some cost formula on HRI to shrink HRI’s size. • Highlights • Not Scalable, Expensive Routing Updates but better than replicating data indexes. • Assumes static environment but No Data Replication Required

Search Techniques for P2P systems 6. Directed BFS and the >RES Heuristic 1/2 [Beverly Yang and Hector Garcia-Molina, ICDCS 2002.] • Proposed Techniques: • Directed BFS based on aggregate statistics (e.g. num of results a peer returned, shortest queue, forwarded the most data) • Iterative Deepening, until Z results are returned • Local Indexes, each node maintains the actual index over the data of peers r hops away. • Their experiments deploy the Direct BFS techniques by attaching nodes to the Gnutella Network. • The >RES Heuristic is shown to be working well.

Search Techniques for P2P systems • Directed BFS and the >RES Heuristic 2/2 • The >RES Heuristic is optimized to find Z documents efficiently for some user defined Z. • >RES works well because: • It captures stable/large network segments. • Potentially less overloaded peers • >RES is a quantitative approach • Drawback: >RES doesn’t route queries to most relevant content

Search Techniques for P2P systems 7. Depth-First-Search and Freenet [I. Clarke O. Sandberg, B. Wiley, and T.W. Hong, LNCS 2009 ] Idea: Objects are Hashed and route the hash of a query based on the “key closeness” in a DFS manner. Highlights: • Uses caching of key/object for future requests. • Data Replication along the QueryHit path provides Availability • Anonymity of Searcher and Publisher. • Drawbacks: i) Searches ONLY based on Object Identifier. ii) The user-perceived delay is high S B replicated B A file:A QUERY h(A) 1 Search: A C result: S 2 Peer q R original file:A

Search Techniques for P2P systems 8. Consistent Hashing and Chord [Ion Stoica et al. SIGCOMM 2001] Idea: Objects/Nodes are hashed with m-bit identifier and organized in a virtual ring. Object lookup is achieved in O(logN). Highlights: • Consistent Hashing achieves : (i) Good Load Balancing of keys (ii) Little object/key movement in case of node join/leave . • Drawbacks: i) Searches ONLY based on Object Identifier ii) Data Movement may be a big overhead.

Peer q Intelligent Search Mechanism ISM Introduction • Idea: Each Query Message is forwarded intelligently based on what queries a peer answered in the past. • Components of ISM (for each node u) • Profile Mechanism, for eachneighborN(u). • Peer Ranking Mechanism, for ranking peers locally and send a search query only to the ones that most likely will answer. • Similarity Function, for finding similar search queries. • Search Mechanism, for propagating queries based on local indexes A QUERY 1 profiles QUERYHIT 2 ? Peer d

Intelligent Search Mechanism ISM Components of ISM a) Profile mechanism. • Maintains a list of past queries routed through that host. • Every time a QueryHit is received the table is updated • The profile manager uses a Least Recently Used policy to keep most recent queries in repository. • Profiles are kept for neighbors only so the cost for maintaining this cost is O(Td),Tis a limiting factor per profile, dis the degree of a node Size: T*d }

Example Assume host Plneeds to forward a query q=“italy disaster” to two of its peers {P1, P2, P3}.Pkmaintains queries {q1 ,q2,. ,q5}in its profile. => RR(P1, q) = 0.8x 2 = 1.6 P1 Sim(q, q1) = 0.8 Sim(q, q2) = 0.6 Sim(q, q3) = 0.5 Sim(q, q4) = 0.4 Sim(q, q5) = 0.4 P2 { } => RR(P2, q) = (0.6x2+ 0.5x2) = 2.2 P3 { } => RR(P3, q) = (0.4x2+ 0.3x2) = 1.4 Intelligent Search Mechanism ISM Components of ISM b) The RelevanceRank Peer Ranking Metric. • Before forwarding a Query Message a peer performs an on-the-fly ranking of its peers to determine the best paths. • We use the Aggregate Weighted Similarity of peer Pi to a query q, computed by a peer Pl as: =2

Intelligent Search Mechanism ISM Components of ISM c) Similarity Function – The cosine similarity. • Assume that Lis a set of all words (in Profile Manager)\ e.g. L={elections, bush, clinton, super, bowl, san, diego, … ,italy, earthquake, disaster} • We define an |L|-dimensional space where each query is a vector. If q=“italy disaster” => q (vector of q) = [0,0,0,…,1,0,1] • Recall that we have a vector for each qi stored in the Profile Manager ( i.e. qi)

Peer q Intelligent Search Mechanism ISM Components of ISM d) Search Mechanism • Utilizes the Peer Ranking Mechanism to forward Queries to nodes that will potentially contain the info we are looking for Peer d profiles ? QUERY 1 ?

Intelligent Search Mechanism ISM Breaking cycles with Random Perturbation • Suppose that nodes answers to conjunction of q-terms • Suppose that query: q has no answer from A,B,C or D. and that one of them answered to similar q in the past •  Query q fails to explore the segment through E • Random Perturbation adds one additional random message

PeerWare Simulation Infrastructure Introduction • PeerWare is our distributed middleware infrastructure that allows us to benchmark various Query Routing Algorithms. • It is deployed on a network of 50 workstations • It uses Public/Private Keys and SSH to connect to the networked hosts. • It is implemented in JAVA and consists of approximately 10000 lines of code.

PeerWare Simulation Infrastructure Why real middleware and not simulations? • Many properties such as network failures, dropped queries may reveal interesting and unknown patterns. • In a real middleware we are able to measure the actual time to satisfy queries. • Finally there are no assumptions (network delays etc) which are typical in simulation environments The Anthill Project (Univ. of Bologna) uses a similar approach to investigate properties of the Freenet algorithm.

PeerWare Simulation Infrastructure PeerWare Components • dataGen – The Dataset Generator • graphGen – The Network Graph Generator • dataPeer – The Data Node • searchPeer – The Search Node Other Administrative Components • netLaucher – Shell script that launches Network • netStats – Shell script that provides statistics • graphPlot – Shell script that plots Graphs based on generated results.

PeerWare Simulation Infrastructure 1) dataGen Component • dataGen is the Dataset Generator which generates documents about specific documents (each peer can have some specialized knowledge) • It uses the REUTERS News Agency dataset (22,531 documents). • It groups documents by various properties: {Date, Topics, Places, People, Orgs, Companies} • In our experiments we use the Places attribute and generate 104 countries.

PeerWare Simulation Infrastructure 2) graphGen Component • graphGen is topology generator • Currently it generates Random Topologies given parameters such as {degree, IPs, ports} • It generates with graphViz visualizations of the generated topologies.

mexico Data-Peer (e.g. usa) argentina Routing Structures (Profiles) u.k china italy XQL PDOM-XML P2P Network Manager Module india france greece germany usa.graph XML Data Files PeerWare Simulation Infrastructure 3) dataPeer Component • dataPeer is a P2P client that maintains an XML repository of documents. • It uses the PDOM-XQL engine to query its documents. • It pre-establishes connections to other peers with persistent TCP connections

PeerWare Simulation Infrastructure 4) searchPeer Component • searchPeer is a P2P client that connects to a PeerWare Network and performs unstructured queries. • Keywords are sampled from within the dataset • It logs statistics such as query response time, nodes answered to a node etc.

Experimental Evaluation Introduction • We create a distributed Newspaper application • We use a Random Network of 104 peers • Each peer has documents for 1 country • The average degree of a node is 7 ~= log2100 (connected graph) • We perform two series of experiments • 10x10 sequential queries with a delay of 4 sec. • 400 random queries with a delay of 4 sec. • We compare Doc. Ratio (Recall Rate) vs. Num. of messages • BFS (Gnutella Message Flooding) (forward to degree nodes). • Random BFS (randomly forward to degree/2 nodes). • Intelligent Search Mechanism (forward to M=(degree/2)-1 highest RelevanceRank nodes + 1 random). • >RES Heuristic (forward to degree/2 nodes that answered >RES)

Experimental Evaluation Reducing Query Messages (10x10 Experiment) Recall Rate vs. Num. of messages with TTL=4 • BFS uses ~1050 messages w/ recall rate 100% • RBFS uses ~220 (20%) msgs w/ recall rate ~50% • >RES uses ~400 (38%) msgs w/ recall rate ~70% • ISM uses ~400 (38%) msgs w/ recall rate ~90% • ISM improves over time since Peer Profiles get more knowledge. • ISM and >RES start out slow since the use RBFS until they populate their routing structures

Experimental Evaluation Digging Deeper by Increasing the TTL (10x10) • Recall Rate vs. Num. of messages with TTL=5 • BFS uses again ~1050 messages w/ recall rate 100% • RBFS uses ~450 (43%) msgs w/ recall rate ~82% • >RES uses ~570(54%) msgs w/ recall rate ~90% • ISM uses ~570 (54%) msgs w/ recall rate ~99%

Experimental Evaluation Reducing Query Response Time (QRT)(10x10 Experiment) • BFS’s QRT is in the order of 6 seconds • RBFS, ISM and >RES use 30-60% of BFS for TTL=4 60-80% of BFS for TTL=5 • BFS unnecessary messages increase the user perceived delay The Query Response Time as a percentage of BFS

Experimental Evaluation The Discarded Message Problem (DMP) • A query q is identified by a GUID. • To avoid cycles a node never forwards a query it already forwarded. • DMP occurs if a node has forwarded q with TTL1 and then receives again q with TTL2, where TTL2>TTL1 • In our experiments approximately 30% of queries were affected by the DMP problem.

Experimental Evaluation Improving Recall Rate over Time (400 Experiment) • 10x10 Queries Experiment suited well ISM • In this experiment we perform 400 random queries • BFS overwhelming message create two major outbreaks • ISM improves over time achieving: 96% Recall Rate using again 38% of Messages

Conclusions • Efficient Information Retrieval in P2P networks is not feasible with the current Search Algorithms. • We propose an Intelligent Search Mechanism that uses local knowledge to improve Information Retrieval in P2P. • We implement PeerWare and evaluate the performance of various Search Techniques • The ISM achieves in some cases 100% recall rate while using only 57% of the BFS messaging.

Future Work • Probe different Network Topologies such as ASMap with PowerLaws. • Deploy larger PeerWares with more queries. • Probe different Peer-Profile maintenance policies. • Use Stemming/Stop Words to answer more accurately. • Compare the performance of our method with new proposed techniques (random gossiping, random walkers, etc). • 60% of Gnutella belongs to 20% ISPs. How to exploit that to provide more efficient query routing schemes?

Dept. of Computer Science & Engineering. @ University of California - Riverside “Information Retrieval in Peer-to-Peer Systems” Thank You! Demetrios Zeinalipour-Yazti M.Sc. Thesis Defense Monday, May 5, 2003Surge 34912:00-1:00 PM Thesis Committee: Dr. Dimitrios Gunopulos, Chairperson Dr. Vana Kalogeraki Dr. Chinya V. Ravishankar http://www.cs.ucr.edu/~csyiazti/msc.html

“Information Retrieval in Peer-to-Peer Systems”

“Information Retrieval in Peer-to-Peer Systems”

Presentation Transcript

A Survey of Peer-to-Peer Content Distribution Technologies

Information Retrieval Techniques For Peer-To-Peer Networks

Peer-to-Peer Systems

Peer-to-Peer Computing

Freenet: A Distributed Anonymous Information Storage and Retrieval System

Distributed Hash-based Lookup for Peer-to-Peer Systems

Peer-to-Peer Systems

An Overview of Peer-to-Peer

Distributed Systems Concepts and Design Chapter 10: Peer-to-Peer Systems

Peer-to-Peer Network

Recent Problems in Peer-to-peer Content Retrieval

Peer-to-Peer Filesystems

Peer Group

Reliable Distributed Systems

Comparing Hybrid Peer-to-Peer Systems

A Framework for Structured Peer-To-Peer Systems

Peer-to-Peer Streaming Systems

Exploiting locality for scalable information retrieval in peer-to-peer networks

Aggregating Information in Peer-to-Peer Systems for Improved Join and Leave

Information Retrieval in Peer to Peer Systems

Lecture XIV: P2P