1 / 34

Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung. Similarity Search on Bregman Divergence, Towards Non-Metric Indexing. Metric v.s. Non-Metric. Euclidean distance dominates DB queries Similarity in human perception Metric distance is not enough!. Outline.

reeves
Download Presentation

Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

  2. Metric v.s. Non-Metric • Euclidean distance dominates DB queries • Similarity in human perception • Metric distance is not enough! Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  3. Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  4. Bregman Divergence h (q,f(q)) convex function f(x) (p,f(p)) Bregman divergence Df(p,q) q p Euclidean dist. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  5. Bregman Divergence • Mathematical Interpretation • The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q original f(x) first order Taylor expansion of f(x) at q Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  6. Bregman Divergence • General Properties • Uniqueness • A function f(x) uniquely decides the Df(p,q) • Non-Negativity • Df(p,q)≥0 for any p, q • Identity • Df(p,p)=0 for any p • Symmetry and Triangle Inequality • Do NOT hold any more Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  7. Examples Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  8. Why in DB system? • Database application • Retrieval of similar images, speech signals, or time series • Optimization on matrices in machine learning • Efficiency is important! • Query Types • Nearest Neighbor Query • Range Query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  9. Euclidean Space • How to answer the queries • R-Tree Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  10. Euclidean Space • How to answer the queries • VA File Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  11. Our goal • Re-use the infrastructure of existing DB system to support Bregman divergence • Storage management • Indexing structures • Query processing algorithms Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  12. Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  13. Basic Solution • Extended Space • Convex function f(x) = x2 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  14. Basic Solution • After the extension • Index extended points with R-Tree or VA File • Re-use existing algorithms with new lower and upper bound computation Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  15. How to improve? • Reformulation of Bregman divergence • Tighter bounds are derived • No change on index construction or query processing algorithm Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  16. A New Formulation h h’ query vector vq Df(p,q)+Δ q p D*f(p,q) Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  17. Math. Interpretation • Reformulation of similarity search queries • k-NN query: query q, data set P, divergence Df • Find the point p, minimizing • Range query: query q, threshold θ, data set P • Return any point p that Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  18. Naïve Bounds • Check the corners of the bounding rectangles Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  19. Tighter Bounds • Take the curve f(x) into consideration Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  20. Query distribution • Distortion of rectangles • The difference between maximum and minimum distances from inside the rectangle to the query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  21. Can we improve it more? • When Building R-Tree in Euclidean space • Minimize the volume/edge length of MBRs • Does it remain valid? Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  22. Query distribution • Distortion of bounding rectangles • Invariant in Euclidean space (triangle inequality) • Query-dependent for Bregman Divergence Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  23. Utilize Query Distribution • Summarize query distribution with O(d) real number • Estimation on expected distortion on any bounding rectangle in O(d) time • Allows better index to be constructed for both R-Tree and VA File Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  24. Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  25. Experiments • Data Sets • KDD’99 data • Network data, the proportion of packages in 72 different TCP/IP connection Types • DBLP data • Use co-authorship graph to generate the probabilities of the authors related to 8 different areas Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  26. Experiment • Data Sets • Uniform Synthetic data • Generate synthetic data with uniform distribution • Clustered Synthetic data • Generate synthetic data with Gaussian Mixture Model Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  27. Experiments • Methods to compare Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  28. Experiments • Index Construction Time Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  29. Experiments • Varying dimensionality Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  30. Experiments • Varying dimensionality (cont.) Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  31. Experiments • Varying k for nearest neighbor query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  32. Conclusion • A general technique on similarity for Bregman Divergence • All techniques are based on existing infrastructure of commercial database • Extensive experiments to compare performances with R-Tree and VA File with different optimizations Similarity Search on Bregman Divergence: Towards Non-Metric Indexing

  33. Acknowledgment • Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R-252-000-376-279. • Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.

  34. Q & A

More Related