1 / 30

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001. Query Requirement. Similarity queries: Similarity range and KNN queries

platt
Download Presentation

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. iDistance-- Indexing the DistanceAn Efficient Approach to KNN Indexing • C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. • Indexing the distance: an efficient method to KNN processing, VLDB 2001.

  2. Query Requirement • Similarity queries: • Similarity range and KNN queries • Similarity range query: Given a query point, find all data points within a given distance r to the query point. • KNN query: Given a query point, • find the K nearest neighbours, • in distance to the point. r Kth NN

  3. Other Methods • SS-tree : R-tree based index structure; use bounding spheres in internal nodes • Metric-tree : R-tree based, but use metric distance and bounding spheres • VA-file : use compression via bit strings for sequential filtering of unwanted data points • Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN • A-tree: R-tree based, but use relative bounding boxes • Problems: hard to integrate into existing DBMSs

  4. Basic Definition • Euclidean distance: • Relationship between data points: • Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).

  5. Basic Concept of iDistance • Indexing points based on similarity y = i * c + dist (Si, p) Reference/anchor points d S3 S1 S2 . . . S1 S2 S3 Sk c Sk+1 S1+d

  6. iDistance • Data points are partitioned into clusters/ partitions. • For each partition, there is a Reference Point that every data point in the partition makes reference to. • Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree • Iterative range queries are used in KNN searching.

  7. KNN Searching S2 S1 ... ... A range in B+-tree • Searching region is enlarged till getting K NN.

  8. KNN Searching dist (S1, q) dist(S2, q) S1 S2 q Dis_min(S2) Dis_min(S1) Dis_max(S2) Dis_max(S1) Increasing search radius : r r S1 S2 0 dist (S1,q) Dis_max(S1) dist (S2,q) Dis_min(S1) Dis_max(S2)

  9. KNN Searching Q2

  10. Over Search? • Inefficient situation: • When K= 3, query sphere with radius r will retrieve the 3 NNs. • Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3) o2 o1 S q r o3 dist (S, q)

  11. Stopping Criterion • Theorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct. Case 1: dist(furthest(KNN’), q) < r Case 2: dist(furthest(KNN’), q) > r r Kth ? In case 2

  12. Space-based Partitioning: Equal-partitioning (external point, closest distance) (centroid of hyperplane, closest distance)

  13. Space-based Partitioning:Equal-partitioning from furthest points (centroid of hyper-plane, furthest distance) (external point, furthest distance)

  14. Effect of Reference Points on Query Space • Using external point to reduce searching area

  15. Effect on Query Space The area bounded by these arches is the affected searching area. • Using (centroid, furthest distance) can greatly reduce search area

  16. Data-based Partitioning I 1.0 0.70 0.31 0 0.20 0.67 1.0 Using cluster centroids as reference points

  17. Data-based Partitioning II 1.0 0.70 0.31 0 0.20 0.67 1.0 Using edge points as reference points

  18. Performance Study:Effect of Search Radius Dimension = 8 Dimension = 16 • 100K uniform data set • Using (external point, furthest distance) • Effect of search radius on query accuracy Dimension = 30

  19. I/O Cost vs Search Radius • 10-NN queries on 100K uniform data sets • Using (external point, furthest distance) • Effect of search radius on query cost

  20. Effect of Reference Points • 10-NN queries on 100K 30-d uniform data set • Different Reference Points

  21. Effect of Clustered # of Partitions on Accuracy • KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number

  22. Effect of # of Partitions on I/O and CPU Cost • 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs

  23. Effect of Data Sizes • KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets

  24. Effect of Clustered Data Sets • 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set

  25. Effect of Reference Points on Clustered Data Sets • 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid

  26. iDistance ideal for Approximate KNN? • 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set

  27. Performance Study -- Compare iMinMax and iDistance • 10-KNN query on 100K 30-d clustered data sets • C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+-trees.

  28. iDistance vs A-tree

  29. iDistance vs A-tree

  30. Summary of iDistance • iDistance is simple, but efficient • It is a Metric based Index • The index can be integrated to existing systems easily.

More Related