Improve search in unstructured P2P overlay

Improve search in unstructured P2P overlay

Peer-to-peer Networks • Peers are connected by an overlay network. • Users cooperate to share files (e.g., music, videos, etc.)

(Search in) Basic P2P Architectures • Centralized: centraldirectory server. (Napster) • Decentralized: search is performed by probing peers • Structured (DHTs): (Can, Chord,…) location is coupled with topology - search is routed by the query. • Only exact-match queries, tightly controlled overlay. • Unstructured: (Gnutella) search is “blind” - probed peers are unrelated to query.

Topics • Search strategies • Beverly Yang and Hector Garcia-Molina, “Improving Search in Peer-to-Peer Networks”, ICDCS 2002 • Arturo Crespo, Hector Garcia-Molina, “Routing Indices For Peer-to-Peer Systems”, ICDCS 2002 • Short cuts • Kunwadee Sripanidkulchai, Bruce Maggs and Hui Zhang, “Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems”, infocom 2003. • Replication • Edith Cohen and Scott Shenker, “Replication Strategies in Unstructured Peer-to-Peer Networks”, SIGCOMM 2002.

Improving Search in Peer-to-Peer Networks ICDCS 2002 Beverly Yang Hector Garcia-Molina

Motivation • The propose of a data-sharing P2P system is to accept queries from users, and locate and return data (or pointers to the data). • Metrics • Cost • Average aggregate bandwidth • Average aggregate processing cost • Quality of results • Number of results • Satisfaction : a query is satisfied if Z (a value specified by user) or more results are returned. • Time to satisfaction

Current Techniques • Gnutella • BFS with depth limit D. • Waste bandwidth and processing resources • Freenet • DFS with depth limit D. • Poor response time.

Broadcast policies • Iterative deepening • Directed BFS • Local Indices

Iterative Deepening • In system where satisfaction is the metric of choice, iterative deepening is a good technique • Under policy P= { a, b, c} ;waiting time W • A source node S first initiates a BFS of depth a • The query is processed and then becomes frozen at all nodes that are a hops from the source • S waiting for a time period W

Iterative Deepening • If query is not satisfied, S will start the next iteration, initiating a BFS of depth b. • S send a “Resend” with a TTL of a • A node that receives a Resend message will simply forward the message or if the node is at depth a, it will drop the resend message and unfreeze the corresponding query by forwarding the query message with a TTL of b-a to all its neighbors • A node need only freeze a query for slightly more than W time units before deleting it

Directed BFS • If minimizing response time is important to an application, iterative deepening may not be appropriate • A source send query messages to just a subset of its neighbors • A node maintains simple statistics on its neighbors • Number of results received from each neighbor • Latency of connection

Directed BFS (cont) • Candidate nodes • Returned the Highest number of results • The neighbor that returns response messages that have taken the lowest average number of hops • High message count

Local Indices • Each node n maintains an index over the data of all nodes within r hops radius. • All nodes at depths not listed in the policy simply forward the query. • Example: policy P= { 1, 5}

Experimental result

Routing Indices For Peer-to-Peer Systems Arturo Crespo, Hector Garcia-Molina Stanford University {crespo,hector}@db.Stanford.edu

Motivation • A key part of a P2P system is document discovery • The goal is to help users find documents with content of interest across potential P2P sources efficiently • The mechanisms for searching can be classified in three categories • Mechanisms without an index • Mechanisms with specialized index nodes (centralized search) • Mechanisms with indices at each node (distributed search)

Motivation (cont.) • Gnutella uses a mechanism where nodes do not have an index • Queries are propagated from node to node until matching documents are found • Although this approach is simple and robust, it has the disadvantage of the enormous cost of flooding the network every time a query is generated • Centralized-search systems use specialized nodes that maintain an index of the documents available in the P2P system like Napster • The user queries an index node to identify nodes having documents with the content • A centralized system is vulnerable to attack and it is difficult to keep the indices up-to-date

Motivation (cont.) • A distributed-index mechanism • Routing Indices (RIs) • Give a “direction” towards the document, rather than its actual location • By using “routes” the index size is proportional to the number of neighbors

Peer-to-peer Systems • A P2P system is formed by a large number of nodes that can join or leave the system at any time • Each node has a local document database that can be accessed through a local index • The local index receives content queries and returns pointers to the documents with the requested content

Query Processing in a Distributed Search P2P System • In a distributed-search P2P system, users submit queries to any node along with a stop condition • A node receiving a query first evaluates the query against its own database, returns to the user pointers to any results • If the stop condition has not been reached, the node selects one or more of its neighbors and forwards the query to them • Queries can be forwarded to the best neighbors in parallel or sequentially • A parallel approach yields better response time, but generates higher traffic and may waste resources

Routing indices • The objective of a Routing Index (RI) is to allow a node to select the “best” neighbors to send a query • A RI is a data structure that, given a query, returns a list of neighbors, ranked according to their goodness for the query • Each node has a local index for quickly finding local documents when a query is received. Nodes also have a CRI containing • the number of documents along each path • the number of documents on each topic

Routing indices (cont.) • Thus, the number of results in a path can be estimated as : • CRI(si) is the value for the cell at the column for topic si and at the row for a neighbor • The goodness of B: 6 C: 0 D: 75 • Note that these numbers are just estimates and they are subject to overcounts and/or undercounts • A limitation of using CRIs is that they do not take into account the difference in cost due to the number of “hops” necessary to reach a document

Using Routing Indices

Using Routing Indices (cont.) • The storage space required by an RI in a node is modest as we are only storing index information for each neighbor • t is the counter size in bytes, c is the number of categories, N the number of nodes, and b the branching factor • Centralized index would require t × (c + 1) × N bytes • the total for the entire distributed system is t × (c + 1) × b × N bytes • the RIs require more storage space overall than a centralized index, the cost of the storage space is shared among the network nodes

Creating Routing Indices

Maintaining Routing Indices • Maintaining RIs is identical to the process used for creating them • For efficiency, we may delay exporting an update for a short time so we can batch several updates, thus, trading RI freshness for a reduced update cost • We can also choose sending minor updates, but reduce accuracy of the RI

Hop-count Routing Indices

Hop-count Routing Indices (cont.) • The estimator of a hop-count RI needs a cost model to compute the goodness of a neighbor • We assumes that document results are uniformly distributed across the network and that the network is a regular tree with fanout F • We define the goodness (goodness hc) of Neighbor iwith respect to query Q for hop-count RI as: • If we assume F = 3, the goodness of X for a query about “DB” documents would be 13+10/3 = 16.33 and for Y would be 0+31/3 = 10.33

Exponentially aggregated RI • Each entry of the ERI for node N contains a value computed as: • th is the height and F the fanout of the assumed regular tree, goodness() is the Compound RI estimator , N[j] is the summary of the local index of neighbor j of N, and T is the topic of interest of the entry • Problems?!

Exponentially aggregated RI (cont.)

Cycles in the P2P Network • There are three general approaches for dealing with cycles: • No-op solution: No changes are made to the algorithms • Cycle avoidance solution: In this solution we do not allow nodes to create an “update” connection to other nodes if such connection would create a cycle • Absence of global information • Cycle detection and recovery: This solution detects cycles sometime after they are formed and, after that, takes recovery actions to eliminate the effect of the cycles

Experimental Results • Modeling search mechanisms in a P2P system: • We consider three kinds of network topologies: • a tree because it does not have cycles • we start with a tree and we add extra vertices at random (creating cycles) • a power-law graph, is considered a good model for P2P systems and allows us to test our algorithms against a “realistic” topology • We model the location of document results using two distributions: uniform and an 80/20 biased distribution • 80/20 assigns uniformly 80% of the document results to 20% of the nodes • In this paper we focus on the network and we use the number of messages generated by each algorithm as a measure of cost

Experimental Results (cont.)

Experimental Results (cont.) • In particular, CRI uses all nodes in the network, HRI uses nodes within a predefined a horizon, and ERI uses nodes until the exponentially decayed value of an index entry reaches a minimum value • In the case of the No-RI approach, an 80/20 document distribution penalizes performance as the search mechanism needs to visit a number of nodes until it finds a content-loaded node

Experimental Results (cont.) • RIs perform better in a power-law network than in a tree network (Query) • In a power-law network a few nodes have a significantly higher connectivity than the rest • Power-law distributions generate network topologies where the average path length between two nodes is lower than in tree topologies

Experimental Results (cont.) • The tradeoff between query and update costs for RIs • The cost of CRI is much higher when compared with HRI and ERI • ERI only propagate the update to a subset of the network

Conclusions • Achieve greater efficiency by placing Routing Indices in each node. Three possible RIs: compound RIs, hopcount RIs, and exponential RIs • From experiments, ERIs and HRI offer significant improvements versus not using an RI, while keeping update costs low

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems

Background • Each peer is connected randomly, and searching is done by flooding. • Allow keyword search Example of searching a mp3 file in Gnutella network. The query is flooded across the network.

An chord with about 50 nodes. The black lines point to adjacent nodes while the red lines are “finger” pointers that allow a node to find key in O(log N) time. Background • DHT (Chord): • Given a key, Chord will map the key to the node. • Each node need to maintain O(log N) information • Each query use O(log N) messages. • Key search means searching by exact name

Interest-based Locality • Peers have similar interest will share similar contents

Architecture • Shortcuts are modular. • Shortcuts are performance enhancement hints.

Creation of shortcuts • The peer use the underlying topology (e.g. Gnutella) for the first few searches. • One of the return peers is selected from random and added to the shortcut lists. • Each shortcut will be ordered by the metric, e.g. success rate, path latency. • Subsequent queries go through the shortcut lists first. • If fail, lookup through underlying topology.

Performance Evaluation • Performance metric: • success rate • load characteristics (query packets per peers process in the system) • query scope (the fraction of peers in each query) • minimum reply path length • additional state kept in each node

Methodology – query workload • Create traffic trace from the real application traffic: • Boeing firewall proxies • Microsoft firewall proxies • Passively collect the web traffic between CMU and the Internet • Passively collect typical P2P traffic (Kazza, Gnutella) • Use exact matching rather than keyword matching in the simulation. • “song.mp3” and “my artist – song.mp3” will be treated as different.

Methodology – Underlying peers topology • Based on the Gnutella connectivity graph in 2001, with 95% nodes about 7 hops away. • Searching TTL is set to 7. • For each kind of traffic (Boeing, Microsoft… etc), run 8 times simulations, each with 1 hour.

Simulation Results – success rate

Simulation Results –load and path length -- Query load for Boeing and Microsoft Traffic: -- Average path length of the traces:

Diminished return 7 ~ 12 % performance gain Add all shortcut at a time, no limit on the shortcut size Add k shortcut at a time, only 100 shortcuts are used. Increase Number of Shortcuts Enhancement of Interest-based Locality

Enhancement of Interest-based Locality Using Shortcuts’ Shortcuts • Idea: Add the shortcut’s shortcut Performance gain of 7% on average

Improve search in unstructured P2P overlay

Improve search in unstructured P2P overlay

Presentation Transcript

Overlay/P2P Networks

Unstructured P2P Networks

Semantic Overlay Networks in P2P systems

Unstructured P2P overlay

Adaptive Trust Aware Community in Unstructured P2P Network

LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems

Improving Search in P2P Networks

Search and Replication in Unstructured P2P Networks

Routing on P2P Overlay Networks

Structured P2P overlay networks

P2P Concept Search

Improving Search in P2P Networks

Unstructured P2P Networks

Search in Unstructured Networks

P2P Search

Maintaining Replicas in Unstructured P2P Systems

Semantic Overlay Networks for P2P Systems

Search in P2P architecture