Range and kNN Searching in P2P

Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong

Outline • Range query searching in P2P • one dimension range query • multi-dimension range query • comparison of range query searching in P2P • kNN searching in P2P • scalable nearest neighbor searching • PierSearch • Conclusion

Motivation • Most P2P systems support only simple lookup queries • The DHT based approaches such as Chord, CAN are not suitable for range queries • More complicated queries such as range query and kNN searching is needed

P-Tree [APJ+04] • B+-tree is widely used for efficiently evaluating range queries in centralized database • distributed B+-tree is not directly applicable in a P2P environment • fully independent B+-tree • semi-independentB+-tree, i.e. P-tree

P1: 4 P2: 8 P8: 35 P3: 12 P7: 26 P4: 20 P6: 25 P5: 24 Fully independent B+-tree 4 24 26 4 8 12 20 24 25 26 35 26 8 20 24 35 8 26 35 4 8 12 20 24 25 24 25 26 35 4 8 12 20

Semi-independently B+-tree 4 24 26 8 24 35 35 8 24 4 8 12 20 8 12 20 35 4 P1: 4 P2: 8 P8: 35 26 8 20 12 25 35 P3: 12 P7: 26 26 35 4 12 20 24 P4: 20 P6: 25 P5: 24 20 25 4 25 8 20 24 35 8 20 24 25 26 35 4 24 25 26

Coverage & Separation 4 35 4 20 24 25 35 8 20 25 overlap 4 8 20 24 24 25 25 26 35 4 20 4 8 12 24 25 26 35 anti-coverage

Properties of P-tree • Each node stores O(logdN) nodes • Total storage per node is O(d*logdN) • Require no global coordination among all peers • The search cost for a range query that returns m results is O(m +logdN)

Search Algorithm p1: 21<value <29 l0 4 24 26 8 24 35 35 8 24 l1 4 8 12 20 8 12 20 35 4 P1: 4 P2: 8 P8: 35 26 8 20 12 25 35 P3: 12 P7: 26 26 35 4 12 20 24 P4: 20 P6: 25 P5: 24 20 25 4 25 8 20 24 35 8 20 24 25 26 35 4 24 25 26

Multi-dimension range query • Routing in one-dimensional routing space • ZNet Z-ordering + Skip graph [STZ04] • Hilbert space filling curve + Chord [SP03] • SCRAP [GYG04] • Routing in multi-dimensional routing space • MURK [GYG04]

Desiderata • Locality : the data elements nearby in the data space should be stored in the same node or the close nodes • Load balance : the amount of data stored by each node should be roughly the same • Efficient routing : the number of messages exchanged between nodes for routing a query should be small

Hilbert SFC + Chord • SFC d-dimensional cube -> a line • the line passes once through each point in the volume of the cube 0101 01 0110 1001 10 1010 0100 0111 1000 1011 00 0011 0010 11 1101 1100 0000 0001 1110 1111

Hilbert SFC + Chord • mapping the 1-dimensional index space onto the Chord overlay network topological 0 14 4 8 11 data elements with keys 5, 6, 7, 8

Query Processing • translate the keyword query to relevant clusters of the SFC-based index space • query the appropriate nodes in the overlay network for data-elements 0 0101 0110 1001 1010 1111 11 0100 0111 1000 1011 1100 1101 1110 10 4 14 0011 0010 1101 1100 01 (1*, 0*) 0000 0001 1110 1111 00 8 11 00 01 10 11

Query Optimization (010, *) 000000 111 110 101 100 011 010 001 000 111000 000100 000 001 010 011 100 101 110 111 011110 001001 (000100) (000111, 001000) (001011) (011000, 011001) (011101, 011110)

Query Optimization (cont.) (010,*) 01 10 00 11 00 01 10 11 0 00 01 0110 0001 0010 0111 000100 000111 001011 011000 011101 001000 011001 011110

Query Optimization (cont.) 000000 0 00 111000 000100 (010, *) 01 Pruning nodes from the tree 0 011110 001001 00 01 0110 0001 0010 0111 000100 000111 001011 011000 011101 001000 011001 011110

SCARP [GYG04] • Use z-order or Hilbert space filling curve to map multi-dimensional data down to a single dimension • Range partitioned the one dimension data across the available S nodes • Use Skip graph to rout the queries

MURK: Multi-dimensional Rectangulation with KD-tree • Basic conception: Partitioning high-dimensional data space into “rectangles”, managed by each node. • Partitioning is done based on the KD-tree. The space is split cyclically according to the dimensions and each leaf of the KD-tree corresponds to one rectangle.

Partitioning • Each node joins, split the space along one dimension into two parts of equal load, keeping load balance. • Each node manage data in one rectangle, thus keeping data locality.

Comparison with CAN • The partition based on KD-tree is similar as that in CAN. Both hash data into multi-dimensional space and try to keep load balancing • The major difference is that a new node splits the exiting node data space equally in CAN, rather than splitting load equality.

Routing in MURK • Routing is to create a link between all the neighboring nodes along the relevant nodes. • Based on the greedy routing over the “grid” links, the distance between two node is the minimum Manhattan distance.

Optimization for the routing • “Grid” links are not efficient for the routing. • Maintain skip pointers for each node to speed up the routing. Two methods to chose the skip pointers: • Random. Chose randomly a node from node set. • Space-filling skip graph. Make the skip pointers at exponentially increasing distance.

Discussion • Non-uniformity for the routing neighbors. Resulted from load balancing for the node. • The dynamic data distribution would result in the unbalance for the node data.

Performance

performance

Conclusion • For locality, MURK far outperforms SCRAP. For routing cost, SCRAP is efficient enough, skip pointers are efficient, such as space filling curve skip. • SCRAP using space filling with rang partitioning is efficient in low dimensions. MURK with space filling skip graph performs much better, especially in high dimensions.

pSearch • Motivation • Numerous documents are over the internet. • How to efficiently search the most closely related document without returning too many with little interest. • Problem: Semantically, documents are randomly distributed. • Exhaustively search brings overhead. • No deterministic guarantees.

P2P & IR techniques • Unstructured p2p search • Centralized index with the problem bottleneck. • Flooding-based techniques result in too much overhead. • Heuristic-based algorithm may miss some important documents. • Structured p2p search • DHT based can and chord are suitable for keyword matching. • Traditional IR techniques • Advanced IR ranking algorithm could be adopted into p2p search. • Two IR techniques • Vector space model (VSM). • Latent semantic indexing (LSI).

pSearch • An IR system built on p2p networks. • Efficient and scalable as DHT • Accurate as advanced IR algorithms. • Map semantic space to nodes and conduct nearest neighbor search. • use VSM and LSI to generate semantic space • use CAN to organize nodes.

VSM &LSI • VSM • Document and queries are expressed as term vectors. • Weight of a term: Term frequency* inverse document frequency. • Rank based on the similarity of the document and query: cos (X,Y). X and Y are two term vectors. • LSI • Based on singular value decomposition, transform term vector from high-dimension to low-dimension (L) semantic vector. • Statistically based conception avoids synonymous and noise in document.

pSearch system DOC QUERY

Advantage of pSearch • Exhaustive search in a bounded area while could be ideally accurate. • Communication overhead is limited to transferring query and reference to top documents independent of the corpus size. • A good approximate of the global statistics is sufficient for pSearch.

Challenges • Dimensionality mismatch between CAN and LSI. • Uneven distribution of indices. • Large search region.

Dimensionality mismatch • Not enough nodes (N) in the CAN to partition all the dimensions (L) in the LSI semantic space. • N nodes in CAN could partition log(N) low dimensions (effective dimensionality), leaving others un-partitioned.

Rolling index • Motivation • Small part of the dimensions would contribute a lot to the similarity • Low-dimensions are of high importance. • Partition more dimensions of the semantic space by rotating the semantic vectors. • A semantic vector V=(v0,v1,…,vl). Each time rotate the vector m dimensions. The rotate space i is the vector of ith rotation. Vi=(vi*m,…,v0,v1,…, vi*m-1) m=2.3*ln(n). • Use the rotated vector to route the query and guide the search.

Rolling index • Use more storage (p times) to keep the search in local space. • Selective rotation is expected to be efficient to process the important high dimensions

Balance index distribution • Content-aware node bootstrapping. • Randomly select a document to publish . • Route the node. • Transfers load. • More indices would be distributed by more node. Even random, still balance with large corpus.

Reducing search space • Curse of dimensionality • Data of high-dimensions sparsely populated • In the high-dimension, distance between nearest neighbor becomes large. • Based on data locality, use stored indices on nodes and recently processed query to guide new search.

Content-directed search

Performance

Conclusion • pSearch is a P2P IR system organizing contents around semantics and achieves good accuracy w.r.t system size, corpus size and returned document. • Rolling index resolve the dimension mismatch and could limit space overhead and visited node number. • Content-aware node bootstrapping balance node load to achieve index and query locality • Content–directed search reduce the searching nodes.

kNN searching in P2P Networks Manesh Subhash Ni Yuan Sun Chong

Outline • Introduction to searching in P2P • Nearest neighbor queries • Presentation of the ideas in the papers 1. “A Scalable Nearest Neighbor Search in P2P Systems” 2. “Enhancing P2P File-Sharing with an Internet-Scale Query Processor”

Introduction to searching in P2P • Exact Match queries • Single key retrieval • Linear Hash • CAN, CHORD, PASTRY, TAPESTRY • Similarity based queries • Metric space based • What do we search for? • Rare items or popular items or both.

Nearest neighbor queries • The notion of a metric space • How similar are two objects given a set of objects • Extensible for exact, range and nearest neighbor queries. • Computationally expensive • Distance property satisfies positive-ness, reflexivity, symmetry, triangle inequality.

Nearest neighbor queries (Cont) • Metric space is a pair (D, d) • D : domain of objects • d : the distance function. • Similarity queries • Rangefor F D, a range query retrieves all objects which have a distance < ρ to the query object q F • Nearest neighbor Returns the object closest to q, k-nearest object for kNN. K F

Scalable NN search • Uses the GHT* structure. • Distributed metric index • Supports range and k-NN queries • The GHT* architecture is composed of nodes, peers that can insert, store and retrieve objects using similarity queries. • Assumptions: Message passing, unique network identifiers, Local buckets to store data and lastly, only one bucket per object.

Peer1 Peer2 To other peers Network Node ID (NNID) or Bucket ID(BID) Inner node Bucket Example of the GHT* Network

Scalable NN search (3) • Address Search Trees (AST) • Is a binary search tree • Inner nodes hold routing information • Two pivots pointers to left and right sub-trees • Leaf nodes are pointers to data • Local data is stored in the buckets and can be accessed using the BID • Non local data can be identified using NNID. (All AST leaf nodes are one of the above pointers)

Range and kNN Searching in P2P

Range and kNN Searching in P2P

Presentation Transcript

P2P P2P 2005

Approximate Range Searching in External Memory

Geometric Range Searching

Structuring P2P networks for efficient searching

Parallel Orthogonal Range Searching

Orthogonal Range Searching

Searching and Data Sharing in P2P Systems

Approximate range selection queries in P2P systems

Orthogonal Range Searching

Tradeoffs in Approximate Range Searching Made Simpler

MANETs, P2P, and P2P MANET Overlays

P2P in Windows

kNN, LVQ, SOM

P2P in VoD

Parallel Orthogonal Range Searching

kNN algorithm and CNN data reduction

Orthogonal Range Searching

P2P P2P 2005

Searching and Data Sharing in P2P Systems

kNN and SVM

MANETs, P2P, and P2P MANET Overlays