1 / 40

Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Nearest Neighbor Queries Sung-hsun Su April 12, 2001. [1] Nick Roussopoulos, Stephen Kelley, Frederic Vincent: Nearest Neighbor Queries. SIGMOD Conference 1995: 71-79.

egan
Download Presentation

Nearest Neighbor Queries Sung-hsun Su April 12, 2001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nearest Neighbor QueriesSung-hsun SuApril 12, 2001 • [1] Nick Roussopoulos, Stephen Kelley, Frederic Vincent: Nearest Neighbor Queries. SIGMOD Conference 1995: 71-79. • [2] G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999), 265-318.

  2. Outline • Introduction to Nearest Neighbor Query • Spatial data structure – R-Tree • K-NN Algorithm in [1] • Incremental NN Algorithm in [2]

  3. The Need of NN Query • Used when data have spatial property • Example: Geographical Info System, Astronomical Data • Spatial predicate Find the k nearest stars from the Earth Find the k nearest stars which is at least 10 LY away Find the nearest gas station in the east Find the furthest TCAT bus stop

  4. Difficulties in NN Query • Need to scan the whole table if unordered • Spatial data structure: • 1D – Simply use a B+ tree or other sorted data structure • 2D or higher dimensional? - A sorted structure for all queries? No.

  5. Data structure – First Trial • Need complex data structure • First trial – Fixed grids: Partition the space evenly into rectangles, cubes, … - Search the neighboring grids first - Distance to objects in a grid is bounded • Disadvantage?

  6. Disadvantages of Fixed Grids • May still access many additional objects • Skewed data distribution • Grid size too large: inefficient search Grid size too small: waste of storage • Need some hierarchical and scalable data structure

  7. Spatial Tree Structures • Make it possible to resolve cluster problem • Some Trees provide balanced structure • Insert/split dynamically • Good construction of trees will provide efficient search • Spatial Trees: K-D Tree, R-Tree, LSD-Tree, Quad-Tree … etc

  8. A Glance of Algorithms • [1]: K-NN Query • Apply a modified DFS on R-Tree • [2]: Incremental NN Query • A Priority First Search on different kinds of spatial tree structure • Incremental • Distance browsing

  9. R-Tree Introduction • Balanced structure, like B+Tree • Each node is an MBR (Minimal Bounding Rectangle) • A node minimally bounds all descendants • Non-leaf: (RECT, pointer to a child node) • Leaf: (RECT, pointer to an object) • Branching factor is chosen to fit a block or page

  10. Minimal Bounding Rectangle

  11. R-Tree Example Root Root B G C I J H A B C F K A D D E F G H I J K E Objects

  12. Good and Bad R-Trees • Bad R-Tree: Contains much dead space • Good R-Tree: Minimize overlapped area • MBR estimates its objects better

  13. Algorithms in [1] • Finding K Nearest Neighbors • Two metrics introduced: • MINDIST (optimistic) • MINMAXDIST (pessimistic) • Pruning • DFS Search

  14. Space and Rectangle • Euclidean Space with n dimension: E(n) • A Rectangle is defined by R=(S,T), S, T are two points on a diagonal (r1, r2..rn), (t1, t2..tn) that: For all k=1 to n, tk>rk • Just simplifies computation

  15. MINDIST(Optimistic) • MINDIST(RECT,q): the shortest distance from RECT to query point q • For all descendant (nodes/objects) in RECT, their distance to q is greater or equal than MINDIST(RECT,q) • This provides a lower bound for distance from q to objects in RECT • Use square of the distance as the metric

  16. Calculation of MINDIST • MINDIST(P,R) = if if otherwise (between si and ti) T(t1, t2) (p1,p2) (r1,r2)=(t1,p2) y x S(s1, s2)

  17. MINMAXDIST(Pessimistic) • MBR property: Every face (edge in 2D, rectangle in 3D, hyper-face in high D) of any MBR contains at least one point of some spatial object in the DB. • MINMAXDIST: Calculate the maximum dist to each face, and choose the minimal. • Upper bound of minimal distance • At least 1 object with distance less or equal to MINMAXDIST in the MBR

  18. Illustration of MINMAXDIST (t1,t2) MINDIST (t1,p2) (p1,p2) MINMAXDIST y x (s1,s2) (t1,s2)

  19. Calculation of MINMAXDIST • Can be done in O(n)

  20. Pruning • MINDIST(M) > MINMAXDIST(M’) : • M can be pruned • Distance(O) > MINMAXDIST(M’) : • O can be discarded • MINDIST(M) > Distance(O) • M can be pruned

  21. DFS Search on R-Tree • Traversal: DFS • Expanding Non-leaf: Order its children by the metrics (MINDIST or MINMAXDIST). Prune before/after visiting each child. • Expanding Leaf: Compare objects to the nearest neighbor found so far. Replace it if the new object is closer. • Not a straight-forward approach - make only local decision • May visit non-optimal objects before the NN is found. • Best first search: simple, and never visit non-optimal nodes.

  22. Extending to K-NN • Maintain k nearest neighbors found so far. • Use the k-th furthest MBR/objects for pruning • Blocking algorithm. No pipelining.

  23. Experimental Results • Real world data: TIGER, Satellite data • Synthetic data • R-Tree Construction: (branching factor=50) • Presorting data with Hilbert Number • Apply a packing technique • Branching factor is 50 • Performance measure: # of pages accessed

  24. Experimental Results (Cont’d) • Linear with k (number of neighbors to find), but slowly. • Grow linear with height of the tree  Log(size of data set) • MINDIST outperforms MINMAXDIST • 20% faster in general, 30% in dense data set • Reason: R-Tree is packed very well. MINDIST approaches actual minimal distance.

  25. Problems with this algorithm • Nodes/objects are not visited by order of distance.  Blocking • May access non-optimal objects, and discard/prune them.  Not incremental • Need to know k in advance, no distance browsing, difficult to combine with other predicates.

  26. Distance Browsing • To browse object in distance order • Example: Find the k nearest star with distance > 10LY • How to apply algorithm[1] to this query? • Select stars with distance >10LY first • Materialize the first result • And then build another R-Tree • What if selectivity is very high?

  27. Solution to Distance Browsing • Very low selectivity (nearest city with 2M+ population) • Perform selection first, build an R-Tree, perform k-NN • Otherwise • Need incremental k-NN, pipeline the result to selection operator • Can stop at any time

  28. Overview of algorithm in [2] • A generic algorithm for different spatial data structure and different distance definition. • Use Priority Queue to perform best first search using minimal distance(optimistic). • Ensure that no object/node is visited before another closer object/node.

  29. Search Algorithm • Always expand the nearest node or object in the priority queue. • Treat objects special cases of nodes. • While expanding a node, calculate each children’s distances from query point, and add them into priority queue. • While expanding an object, just report it and then continue.

  30. Requirement for Tree/Distance • Tree/Distance must conform the following rules: • Allow a node/object to have more than one parents • There may be duplicate of object pointer in the tree. • The region covered by a node must be completely contained within union of it parents’ region. • Consistence distance: For all query point q and node/object n, at least one of its parents, n’ has distance d(q,n’) <= d(q,n). (To ensure expanding nodes in order)

  31. Remarks to Tree/Distance • Applicable tree: Quad-tree, R-Tree, R+-Tree, LSD-Tree, K-D-B Tree…etc • Applicable distance measure: Euclidean, Manhattan, Chessboard…etc • Almost of spatial trees don’t have duplicate nodes. A node is fully contained in its parent. • Some trees allow duplicate objects. We have to detect and remove duplicates. • R-Tree doesn’t have duplicates.

  32. Example R=6 R=11 Root B F A D E C

  33. Order of expansion R=0: Expand Root, { A[1],B[7] } R=1: Expand A, { D[1],B[7],C[10] } R=1: Expand D, { Circle[1],B[7],C[10] } R=1: Report Circle, { B[7], C[10] } R=7: Expand B, { E[8], C[10], F[12] } R=8: Expand E, { Rectangle[8], C[10], F[12] } R=8: Report Rectangle, { C[10], F[12] } R=10: Expand C, { F[12], Triangle[13] } R=12: Expand F, { Triangle[13], Moon[14] } R=13: Report Triangle, { Moon[14] } R=14: Report Moon, { }

  34. Observation • All nodes/objects intersecting the search region(circle) are expanded, and their children are put in the queue. • All nodes/objects completely inside the search circle are already taken off the queue. • All nodes/objects completely outside the search circle are not examined. • It minimizes the number of objects to visit.

  35. PseudoCode Queue=NewPriorityQueue(); EnQueue(Queue, Root, 0); While (NotEmpty(Queue)) { Element=Dequeue(Queue) If IsObject(Element) { /*Remove duplicate*/; Report(Element) } If IsLeaf(Element) { For each child object o, if Dist(o,Q)>=Dist(Element,Q) EnQueue(Queue,o,Dist(o,Q)); //Don’t need the comparison for R-Tree } If IsNonLeaf(Element) { For each child object o Enqueue(Queue,o,Dist(o,Q));} }

  36. Variants • K Furthest: • Use MaxDist • Replace <= by >= • Distance selection: Select all stars between 15 LY and 20 LY. • Prune unqualified nodes • Pseudo code for search algorithm combining these 2 extension: Figure 5

  37. Implementation of Priority Queue • Enough memory: Heap (minheap/maxheap) • Not enough: Use B+ Tree (sorted  keep nodes with smaller distance in the memory) • Hybrid Scheme: Divide into 3 tiers. • Tier 1 uses in-memory heap. • Tier 2 is divided into several sections. Nodes in each sections are unordered bucket in memory, and the first bucket is moved to Tier 1 when Tier 1 is empty. • Tier 3 is stored on disk, and moved to memory when tier 1 and 2 is empty.

  38. Theoretical Analysis • Assumption: Uniform distribution, 2D • Use the circular search region for analysis • K  the area of search region • Number of leaf nodes in the priority queue circumference of search region = • Number of leaf nodes accessed = • Number of nodes accessed = • For non-uniform 2D case: very close to the result

  39. Experimental Result • TIGER/Line file (17421~200482 segments) • Synthetic data (infinite random segments) • Construction: R* Tree • Distance Browsing: Inc-NN much faster than k-NN, the ratio increases at • Exact k-NN query: Inc-NN is 10~20% faster • Scalability: close to theoretical result • Very large k: k-NN can’t hold all k neighbors in memory

  40. Conclusion • Inc NN outperforms other k-NN algorithms. • Inc NN enables distance browsing. • Number of node accesses (2D) is • Future work: • Compare this algorithm on different spatial structure • Investigate the behavior on very large data set where the PQ can’t fit into memory.

More Related