Tree-based indexing methods for similarity search in metric and nonmetric spaces

Download Presentation

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Loading in 2 Seconds...

- 63 Views
- Uploaded on
- Presentation posted in: General

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Department of Software Engineering

Faculty of Mathematics and Physics

Charles University in Prague

Mgr. Jakub Lokoč

Supervisor: Doc. RNDr. TomášSkopal, Ph.D.

MFF UK, Prague

- Introduction
- Similarity search
- M-tree

- Contributions & Results
- Metric search
- Nonmetric search

- Outlook

MFF UK, Prague

query object

- How to search in large collections of unstructured data?
- We cannot use relation databases or textual annotation
- Content based similarity searching
- Similarity→ distance functionδ→metric vs. nonmetricsearch
- Feature extraction→ feature space

- Problems of similarity searching
- Effectivity → selection of complex descriptors and (often expensive) distance function (not DB problem)
- Efficiency → indexing → exact vs. approximate search

Feature extraction

Similarity evaluation

Feature extraction

MFF UK, Prague

- δ is metric
- Allows indexing by metric access methods (e.g., M-tree)
- Objects can be organized into separate clusters

- δ is nonmetric
- Robust similarity functions suitable for domain experts
- Not constrained by metric axioms, but only approximate search by metric access methods

- In our work, we have focused onFASTsimilarity search in metric and nonmetric spaces by M-tree

MFF UK, Prague

range query

Q

(euclidean 2D space)

- Structure and properties
- Dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree)
- The leaves are clusters of indexed objectsOj(ground objects)
- Routing entries in the inner nodes represent hyper-spherical metric regions (Oi,rOi), recursively bounding the object clusters in leaves
- The triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

- New construction techniques
- Forced reinserting
- Hybridway leaf selection
- Parallel dynamic batch loading

- Nonmetric search
- M-tree variant - NM-tree

MFF UK, Prague

O5

O3

O1

O7

O9

O4

O5

O1

- Insert new object O11
- Remove O8, O6 and insert them into the stack
- Decrease region’s radius (to O11)
- Insert O6 from the stack
- Remove O2 and insert in the stack
- Decrease region’s radius (to O6)
- Insert O2 from the stack
- Insert O8 from the stack

O4

O6

O1

O3

O11

O11

O5

O2

O7

STACK

O8

O9

O10

O2

O8

O6

O9

O10

- First phase of inserting = find suitable leaf for new OBJ
- Classic selection strategies
- Singleway – fast indexing, less compact hierarchy
- Multiway – vice versa

- Our approach
- User controls how many branches are visited
- Finds suboptimal leaf node
- May return full leaf node

MFF UK, Prague

CoPhIR (color layout and structure), dim 76, dbSize250.000

MFF UK, Prague

1. Aggregation

2. Parallel batch loading

3. Traditional inserting

Not inserted objects

“Split generating” – will be inserted in traditional way (exploiting limited parallelism)

Postponed – will be inserted during the next batch

- To find scalability bottlenecks we measured
- Parallel batch loading time – PI
- Traditional inserts causing split time – ICS
- Traditional inserts not causing split time – INCS

CoPhIR 1.000.000

Dimension 76 (12 + 64)

L5.123456 distance

24 / 25 inner/leaf node size

512MB cache size

- Metric properties – too restrictive
- Triangle inequality is the most attacked one

- Semimetric distances (e.g. in molecular biology)
- But, how to search efficiently?

Identity

Non-negativity

Symmetry

Triangle inequality

2NN ( ) = { , }

2NN ( ) = { , }

MFF UK, Prague

- Relatedwork
- MAMs can employ a semimetricdS for approximate search
- Semimetric behavior can be tuned by transformation functions f(e.g., we can turn semimetric to metric dM = fM(dS))
- More metric behavior – more precise, but slower search
- Less metric behavior – less precise, but faster search

- However, M-tree is fixed to employed (semi)metric (black-box distance)

MFF UK, Prague

- The trick
- We use inversely symmetric transformation functions - dS = f-1 ( f ( dS) )
- fei and fM are evaluated in initial phase
- We index data using dM = fM(dS) (to allow exact searching)
- Stored distances dM can be transformed back to dS = fM-1(dM)
- Retrieval precision ei at query time
- dei = fei(fM-1(fM(dS))) or just dei = fei(dS)
- Metric search in upper levels (by dM)

MFF UK, Prague

MFF UK, Prague

- Metric search
- Combination of more sophisticated M-tree constructions techniques and parallelism
- Adopting the techniques to M-tree descendants
- Employ as a dynamic clustering technique

- Nonmetric search
- Finding better „nonmetric to metric“ transformation functions
- Reuse other MAMs for nonmetric search

MFF UK, Prague

Ciaccia, P., Patella, M., and Zezula, P.

M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB1997

Zezula, P., Savino, P., Rabitti, F., Amato, G., and Ciaccia, P.

Processing M-Tree with Parallel Resources

EDBT 1998

Skopal, T., Pokorny, J., Kratky, M., and Snasel, V.

Revisiting M-tree Building Principles

ADBIS 2003, LNCS 2798, Springer

Skopal T.

Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces

TODS 2007, ACM

MFF UK, Prague

Lokoc J. and SkopalT.

On Reinsertions in M-tree

SISAP 2008, IEEE

SkopalT. and Lokoc J.

NM-Tree: Flexible Approximate SimilaritySearch in Metric and Non-metric Spaces

DEXA 2008, LNCS 5181, Springer

Skopal T. and Lokoc J.

New Dynamic Construction Techniquesfor M-tree

JournalofDiscreteAlgorithms, Elsevier 2009

Lokoc J.

Parallel Dynamic Batch Loading in the M-tree

SISAP 2009, IEEE

J. Novák, T. Skopal, D. Hoksza, J. Lokoč

Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods

SISAP 2010, ACM

J. Lokoč, T. Skopal

On Applications of Parameterized Hyperplane Partitioning

SISAP 2010, ACM

T. Skopal, J. Lokoč

Answering Metric Skyline Queries by PM-tree

DATESO 2010, CEUR

- T. Skopal, J. Lokoč, B. Bustos
- D-cache: Universal Distance Cache for Metric Access Methods
- Major revision, Transactions on Knowledge and Data Engineering

MFF UK, Prague

- Lokoč, J. and Skopal, T. 2008. On Reinsertions in M-tree. In SISAP ’08: Proceedings of the First International Workshop on Similarity Search and Applications. IEEE Computer Society, Washington, DC, USA, 121–128.
- Roberto UribeParedes, Gonzalo Navarro. EGNAT: A Fully Dynamic Metric Access Method for Secondary Memory. In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, p.57-64, August 29-30, 2009, Prague, Czech Republic
- Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010)
- Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010

MFF UK, Prague

- Skopal, T. and Lokoč, J. 2009. New Dynamic Construction Techniques for M-tree. Journal of Discrete Algorithms, Elsevier 7 (1): 62–77.
- Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010)
- Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010
- Kaster D., Bueno R., Bugatti P., Traina A., Traina C. Jr., Incorporating Metric Access Methods for Similarity Searching on Oracle Database, SBBD 2009

MFF UK, Prague

- Lokoč, J. 2009 Parallel Dynamic Batch Loading in the M-tree, In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, pp.117-123, August 29-30, 2009, Prague, Czech Republic
- QiuC. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010

MFF UK, Prague

Thank for your attention

MFF UK, Prague

- σmax is not defined
- σmax is maximal distance in the distance space

- Similarity join (SJ) is not a multiexample query type
- I agree - SJ is rather complex operator consisting of multiple single example queries

- What other costs must be taken into account
- In the case a distance function is cheap (e.g. Lp metrics), we have to take into account internal overhead of a particular MAM (e.g. pivot space filtering in pivot tables)

- Missing database size for figure 1.14
- DbSize = 100.000

MFF UK, Prague

- How to solve leaf node overflows during stack processing in conservative resinsertions
- We perform regular split

- If HW leaf selection is unsuccessful, SW leaf selection is used. Does SW leaf selection employ pre-computed distances from HW?
- We do not use distances from HW leaf selection since HW leaf selection is usually successful and hence we have left the algorithm simple (which reduces internal CPU costs)
- Moreover, it can be solved by the D-cache (see publications)

MFF UK, Prague

- How is changed the number of dimensions (x axis) in figure 3.6
- We have used 76 dim concatenated vector of two features (12 + 64), we used a “prefixes” of this vector

- What causes fluctuations to query costs in figure 3.9
- Reinserting behavior is chaotic with respect to increasing number of removed objects

- Radius change can be propagated to the upper levels of the M-tree, how is this process synchronized?
- Radius is not propagated to upper levels (to improve parallel performance) – but it is a topic of our future work

MFF UK, Prague

- What algorithms have been used during the first two steps of the parallel batch loading iteration?
- In the first step, we have just used simple list for new objects aggregation. In the second step, each thread used SW leaf selection using exclusive locks for radius updates.

- What is the motivation for random heuristic?
- Random heuristic can be faster in the case, the distance measure is cheaper. Moreover, we wanted to test, whether randomly selected objects perform more splits.

- DB size is 1.000.000, batch size is 200, why is the number of iterations > 5000
- It is caused by the fact, that not all objects from the batch are inserted during one iteration.
- ICS and INCS stand for the number of real insertions (ICS = number of leaf node splits)

- What is residue time?
- Residue aggregates realtime overhead and I/O cost.
All other comments will be updated for online version and I thank for them

- Residue aggregates realtime overhead and I/O cost.

MFF UK, Prague