1 / 47

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms. Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine). Outline. Motivation: NN search NNH: Proposed histogram structure Main idea Utilizing NNH in a search (KNN, join) Constructing NNH

yovela
Download Presentation

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

  2. Outline • Motivation: NN search • NNH: Proposed histogram structure • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance • Experiments

  3. q NN-join: find the k nearest neighbors in the 2nd dataset for each object in the 1st dataset D1 D2 NN (nearest-neighbor) search KNN: find the k nearest neighbors of an object.

  4. Example: image search Query image • Images represented as features (color histogram, texture moments, etc.) • Similarity search using these features • “Find 10 most similar images for the query image”

  5. Other Applications • Web-page search • “Find 100 most similar pages for a given page” • Page represented as word-frequency vector • Similarity: vector distance • GIS: “find 5 closest cities of Irvine” • CAD, information retrieval, molecular biology, data cleansing, … • Challenges: Efficiency, Scalability

  6. NN Algorithms • Distance measurement: • For objects are points, distance well defined • Usually Euclidean • Other distances possible • For arbitrary-shaped objects, assume we have a distance function between them • Most algorithms assume a high-dimensional tree structure exists for the datasets.

  7. Example: R-Trees Take 2-d space as an example.

  8. Minimal Bounding Rectangle • MBRis an n-dimensional rectangle that bounds its corresponding objects. • MBR face property: Every face of any MBR contains at least one point of some object

  9. Search process (1-NN for example) • Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach • Keep a priority queue of nodes (mbr’s) to be visited • Sorted based on the “minimum distance” between q and each node • Improvement: • Use MINDIST and MINMAXDIST • Reduce the queue size • Avoid unnecessary disk IO’s to access MBR’s Priority queue

  10. MINDIST & MINMAXDIST

  11. mbr1 q MINDIST mbr2 MINMAXDIST 2. Discard object o if dist(q,o) > MINIMAXDIST(q,mbr2) o q dist mbr2 MINMAXDIST q MINDIST mbr1 dist o Pruning in NN search 3. Discard mbr1 if MINDIST(q,mbr1) > disk(q,o) 1. Discard mbr1 if MINDIST(q,mbr1) > MINMAXDIST(q,mbr2)

  12. Problem • Queue size may be large: • Example: 60,000, 32d (image) vectors, 50 NNs • Max queue size: 15K entries • Avg queue size: half (7.5K entries) • If queue can’t fit in memory, more disk IOs! • Problem worse for k-NN joins • E.g., 1500 x 1500 join: • Max queue size: 1.7M entries: >= 1GB memory! • 750 seconds to run • Couldn’t scale up to 2000 objects! • Disk thrashing

  13. Our Solution: Nearest-Neighbor Histogram (NNH) • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance

  14. NNH: Nearest-Neighbor Histograms pm p2 p1 m: # of pivots Distances of its nearest neighbors: r1, r2, …,

  15. Main idea • Keep a histogram of NN distances of a pre-selected collection of objects (pivots). • They are not part of the database • They give a “big” picture of objects’ locations • Use the histogram to estimate the NN distance of each certain query object. • Use these estimated NN distances to do more pruning in an NN search

  16. Structure • Nearest Neighbor Vectors: each ri is the distance of p’s i-th NN T: length of each vector • Nearest Neighbor Histogram • Collection of m pivots with their NN vectors

  17. Estimate NN distance for query object • NNH does not give exact NN information for an object • But we can estimate an upper bound for the k-NN distance qest of q Triangle inequality

  18. Estimate NN for query object(con’t) • Apply the triangle inequality to all pivots • Upper bound estimate of NN distance of q • Complexity: O(m)

  19. Utilizing estimates in NN search • More pruning: prune an mbr if: q MINDIST mbr

  20. Utilizing estimates in NN join • K-NN join: for each object o1 in D1, find its k-nearest neighbors in D2. • Preliminary algorithm by Hjaltason and Samet [HS98] • Traverse two trees top down; keep a queue of pairs

  21. Utilizing estimates in NN join (cont’t) • Construct NNH for D2. • For each object o1 in D1, keep its estimated NN radius o1estusing NNH of D2. • Similar to k-NN query, ignore mbr for o1 if: MINDIST o1 mbr

  22. More powerful: prune MBR pairs

  23. Prune MBR pairs (cont) mbr1 mbr2 MINDIST Prune this MBR pair if:

  24. How to construct an NNH? • If we have selected the mpivots: • Just run KNN queries for them to construct NNH • Time is O(m) • Offline • Important: selecting pivots • Size-Constraint NNH Construction • Error-Constraint NNH Construction

  25. Size-constraint NNH construction • # of pivots “m” determines • Storage size • Initial construction cost • Incremental-maintenance cost • Choose m “best” pivots

  26. Size-constraint NNH construction • Given m: # of pivots • Assuming: • query objects are from the database D • H(pi,k) doesn’t vary too much • Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots: • Clustering problem: • Many algorithms available • Use K-means for its simplicity and efficiency

  27. Error-constraint NNH construction • Assumptions: • A threshold r is set apriori • Any estimate to the k-NN distance less than r is considered “good” enough. • I.e., a maximum error of r is tolerated for any distance estimate.

  28. Error-constraint NNH construction (cont) • Find a set points S = {p1, p2, …, pm} from the dataset D • For each point pi, its kNN’s are within distance r/2 • Then, for any point q within distance r/2 from pi, we get a distance estimate for the KNN of q:

  29. Error-constraint NNH construction (cont) • Problem: find points such that • They cover the entire data set with spheres of radius r/2 • The sum of distances of all points in each sphere to its center is minimized • An instance of the “k-center problem” • Efficient 2-approximation algorithm using a single pass over the dataset

  30. Incremental Maintenance • How to update the NNH when inserting or deleting objects? • Need to “shift” each vector: • Associate a valid length Ei to each NN vector.

  31. Insertion • Locate the position j in each NN vector where

  32. Insertion (con’t) • If j not found, we don’t need to update this pivot NN vector (why?) • If found: • insert the new radius • shift the vector to the right • increment Ei by 1.

  33. Deletion • Similar to the Insertion • Locate position of • If not found, no update for this vector • If found: • remove rj • shift the rest to the left • decrement Eiby 1

  34. Experiments • Dataset: • Corel image database • Contains 60,000 images • Each image represented by a 32-dimensional float vector • Test bed: • PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. • GNU C++ in CYGWIN

  35. Questions to be answered • Is the pruning using NNH estimates powerful? • KNN queries • NN-join queries • Is it “cheap” to have such a structure? • Storage • Initial construction • Incremental maintenance

  36. Improvement in k-NN search • Run k-means algorithm to generate 400 pivots, and construct the NNH • Perform 10-NN queries on 100 randomly selected query objects. • Queue size as the benchmark for memory usage. • Max queue size • Average queue size

  37. Reduced Memory Requirement

  38. Reduced running time

  39. Effects of different # of pivots

  40. Improvement in k-NN joins • Selected two subsets from the Corel dataset. Each contains 1500 objects. • Unfortunately couldn’t run the PC due to large memory requirement • Ran on a SUN Ultra 4 workstation with four 300MHz CPU and 3GB Memory. • Constructed NNH (400 pivots) for D2.

  41. Join: Reduced Memory Requirement

  42. Join: Reduced running time

  43. Join: Effects of different # of pivots

  44. Join:Running time for different data sizes

  45. Cost/Benefit of NNH For 60,000 32-d float vectors. “0” means almost zero.

  46. Conclusion • NNH: efficient, effective approach to improving NN-search performance. • Can be easily embedded into current implementation of NN algorithms. • Can be efficiently constructed and maintained. • Offers substantial performance advantages.

  47. Work conducted in the Flamingo Project on Data Cleansing at UC Irvine

More Related