Richard Swinbank 9 th July 2004 Bulk Loading the M-tree to Enhance Query Performance

Richard Swinbank 9th July 2004 Bulk Loading the M-tree to Enhance Query Performance Alan P. Sexton & Richard Swinbank University of Birmingham

Bulk Loading the M-tree • The M-tree • Hasn’t this been done already?! • Our approach and motivation • Outlier effects • Symmetry and Deletion • Conclusions

A B C D E a b c d e The M-tree • Like B+ tree; multiway, paged, post-and-grow • ‘Discriminators’ are metric balls, not intervals • No concept of position, only distance • Query performance depends critically on overlap A D d E a c b e C B

Hasn’t this been done already?! • Ciaccia et al., 1998 • Seeded trees: top-down growth • Cheaper to build than insertion-built trees • Comparable query performance • B+ tree • Sort data • Build bottom-up • M-tree • Cluster data • Build bottom-up?

Bulk Loading the M-tree • 25% - 40% query performance gain • Top : 1-NN query results • Bottom : Leaf radii for related trees

Closest-pair clustering • Requirements • Upper (CMAX) and lower (CMAX/2) bound on cardinality • Minimise overlap of metric representation • Algorithm • Take closest pair of clusters (c1, c2) • If |c1| + |c2| <= CMAX, merge, otherwise remove larger cluster from working set • Repeat until working set is empty • Outlier effects

Outlier effects M-tree insertion Closest-pair clustering

Bulk Loading • Use closest-pair clustering to prepare a full level • Accumulate primary medoids to populate next level up • Algorithm • Cluster points • On-the-fly: • Write output clusters to disk: M-tree nodes • Generate parent entries: points for next level up • Repeat until next level is a single page • Bottom-up growth • Subtree containment

Subtree containment on Bulk Load

Subtree containment on Insert

SM-tree vs. M-tree

Conclusions • Closest-pair clustering algorithm • Mitigates outlier effects • Improves query performance • Bulk loading algorithm • Bottom-up, balanced growth • Insert/Delete symmetry: SM-tree • Further work • Questions?

Richard Swinbank 9 th July 2004 Bulk Loading the M-tree to Enhance Query Performance