1 / 17

Clustered Pivot Tables for I/O-optimized Similarity Search

Clustered Pivot Tables for I/O-optimized Similarity Search. Juraj Moško , Jakub Loko č, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague. Presentation outline. Similarity search in metric spaces Pivot tables

tamas
Download Presentation

Clustered Pivot Tables for I/O-optimized Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustered Pivot Tables forI/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics andPhysics Charles University in Prague SISAP 2011, Lipari

  2. Presentation outline • Similarity search in metric spaces • Pivot tables • Clustered pivot tables • Static variant • Dynamic variant • Experiments SISAP 2011, Lipari

  3. Similarity search • Suitableforunstructured data, query often not in DB • Similarityisoftenmodeled by a metric distance • Expensive distance functions- EMD, SQFD, DTW, … • Metricindexing • Based on lower-bounding • Ifabs(d(p, q) – d(p, o)) > r filter out object o SISAP 2011, Lipari

  4. Pivot tables • Simple yet efficientmain memory metric index • Having k static pivots Pi and database S of n objects Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n • Pivot tables = two structures - distance matrix + data file • Cheap filtering of non-relevant objects (lower-bounding) • Non-filtered objects are refined by the original expensive distance function SISAP 2011, Lipari

  5. Clustered pivot tables • What if the pivot table does not fit intomainmemory? • Solution 1 – just slicedatafile • +simple to construct • - sequential scan => high I/O cost • Solution2– reorganize andslicedatafile • +similar objectsin one page (page = cluster)=> higher probability that all objects are filtered=> lower I/O cost • -metric clusteringis expensive SISAP 2011, Lipari

  6. Metric clustering? M-tree! • Dynamic, persistent, balanced structure • Leaf node represents cluster of similar objects • Many construction strategies considering quality of M-tree hierarchy with complexity < O(n2) • Single/Multi/Hybrid-way leaf selection • Slim-down algorithm • Reinsertions SISAP 2011, Lipari

  7. Static CPT • Data file = objects serialized from M-tree leaves • Classic pivot table reorganizing input • Fixed page size in a paged data file • Preserve M-tree? • Future re-indexing • Query processing SISAP 2011, Lipari

  8. Dynamic CPT • Data file = set of M-tree leaves • Distance matrix connected to the M-tree leaves • Internal fragmentation • M-tree leaves contain different number of data objects, utilization is not 100% • Dynamic operations do not degenerate created clusters SISAP 2011, Lipari

  9. CPT - Querying • Filtering based on lower-bounding • If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari

  10. CPT - Querying problems • Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost • Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari

  11. CPT – Querying problems • Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing • Solution - First bunch of objects is not clustered SISAP 2011, Lipari

  12. CPT – Querying problems • Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing • Solution - First bunch of objects is not clustered x x Q Q SISAP 2011, Lipari

  13. Experiments (1) • 2 real datasets • subset of CoPhIR, subset of Corel • 2 synthetic datasets • Cloud, PolygonSet • We considered more M-tree variants • Single/Multi way leaf selection • Reinsertions • Measured I/O cost • CPT vs. PT vs. M-tree SISAP 2011, Lipari

  14. Experiments (2) SISAP 2011, Lipari

  15. Experiments (3) SISAP 2011, Lipari

  16. Conclusion • We have designed I/O-optimized method for persistent pivot tables • Future work • Thorough experiments on SSD disks • Use other metric clustering techniques SISAP 2011, Lipari

  17. Thank you SISAP 2011, Lipari

More Related