1 / 21

Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture fo

Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join. Feature Based Similarity. Simple Similarity Queries. Specify query object and Find similar objects – range query

nili
Download Presentation

Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture fo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität MünchenA Cost Model and Index Architecture for the Similarity Join

  2. Feature Based Similarity

  3. Simple Similarity Queries • Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q.

  4. R S Join Applications: Catalogue Matching • Catalogue matching • E.g. Astronomic catalogues

  5. Join Applications: Clustering • Clustering (e.g. DBSCAN) • Similarity self-join

  6. e R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, e)if IsDirpg (R) Ù IsDirpg (S) thenforeachrÎR.children do foreachsÎS.children doif mindist (r,s) £ethen CacheLoad(r); CacheLoad(s);r_tree_sim_join (r,s,e) ;else (* assume R,S both DataPg *)foreachpÎR.points do foreachqÎS.points do if |p - q| £ ethen report (p,q); R S

  7. Cost Modeling • Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum

  8. Cost Modeling • Binomial formula:

  9. Cost Modeling • Mating probability of index pages: • Probability that distance between two pages £ e • Two-fold application of Minkowski sum

  10. Page Capacity Optimization • Cost model can determine index selectivity which depends on various parameters • Page capacity (number of stored points) is an important parameter • Known from similarity search: Page capacity optimization yields considerable improvement

  11. Analysis of the Index Overhead • Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ? • CPU: • Distance betw. boxes more expensive to compute than distance betw. points: a » 5 • Smaller capacity  more box distance computations

  12. Analysis of the Index Overhead • Disk I/O: • High constant cost per page access (move disk head) • Page access is by factor b » 10000 / d more expensive than continuous reading of a point • Smaller capacity  more disk head movement

  13. Analysis of the Index Overhead • What selectivity is needed that index pays off ?

  14. Optimization • I/O cost function:is optimized by • CPU cost function:is optimized by:

  15. Optimization • I/O cost: • Large capacity optimum (several 10,000 points, typically) • CPU cost: • Small capacity optimum (< 100 points, typically) • No compromise achievable

  16. Multipage Index (MuX) ® CPU-performance like CPU optimized index ® I/O- performance like I/O optimized index separate optimization

  17. Experimental Evaluation Uniform 4D Uniform 8D

  18. Experimental Evaluation CAD Data 16D Color Images 64D

  19. Conclusions • Summary • High potential for performance gains of the similarity join by page capacity optimization • Necessary to separately optimize I/O and CPU • Future research potential • Similarity join for metric index structures • Approximate similarity join • Parallel similarity join algorithms

  20. Consequences • Assume for I/O optimization selectivity » 100% • Page accesses in a nested block loop like style: fill cache with pages of R (1 page free) ; foreachS-page sdo ifs joins some of the cached R-pg then load (s) ; foreach joining R-page r in cache do if mindist(r,s) < ethen join (r,s) ;

  21. e R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, e)if IsDirpg (R) Ù IsDirpg (S) thenforeachrÎR.children do foreachsÎS.children doif mindist (r,s) £ethen CacheLoad(r); CacheLoad(s);r_tree_sim_join (r,s,e) ;else (* assume R,S both DataPg *)foreachpÎR.points do foreachqÎS.points do if |p - q| £ ethen report (p,q); R S

More Related