1 / 20

Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods

Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods. Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information Science and Engineering, University of Florida. Outline. Problem/Solution Background The Omni-concept Members of the Omni-family

della
Download Presentation

Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information Science and Engineering, University of Florida

  2. Outline • Problem/Solution • Background • The Omni-concept • Members of the Omni-family • Experimental Results

  3. Problem • Diverse and complex data • How to search • Expensive distance calculations

  4. Solution • Reduce the number of distance calculations • The Omni-Concept/Family • Select a set of foci • Gauge all other objects with their distance from this set • The foci increase the pruning of distance calculations • Scalable

  5. Background: Metric Spaces • Set of objects S = {s1,s2,s3,…,sn} of domain S, d() has following properties: • Symmetry: d(s1,s2) = d(s2,s1) • Non-negativity: 0<d(s1,s2) < infinity, s1≠ s2, and d(s1,s1) = 0 • Triangle inequality: d(s1,s3) ≤ d(s1,s2) + d(s2,s3) • A metric space is a pair M = <S,d()> • Spatial datasets following an Lp distance function are special cases of metric spaces.

  6. Range and NN Queries • Range: Given a query object sq, and a max search distance rq: Rquery(sq,rq)= {si | si ∈ S: d(si,sq) ≤ rq} • NN: Given a query object sq ∈ S: NNquery(sq)= {sn ∈ S | ∀si ∈ S: d(sn,sq) ≤ d(si,sq)}

  7. Current solutions • Metric tree of Uhlmann • Vantage-point tree • Generalized hyper-plane tree • Multi-vantage point tree • Geometric Near Access tree • The M-tree

  8. Intrinsic Dimensionality • Some assume embedding dimensionality of dataset define behavior on a query. • Datasets can inhibit small portion of embedding space. • Intrinsic dimensionality gives better precision in selectivity. • Use correlation of fractal dimensions D2 as an approximation of the intrinsic dimension.

  9. Omni-concepts • Omni-foci base (F): Given M F = {f1,f2,…,fl | fK ∈ S, fk≠fj, l≤N}, • Omni-coordinates (Ci): Ci = { <fk, d(fk,si)>, for all fk ∈ F} • mbOr: Given F and a collection of objects A = {x1,x2,….xn} ⊂ S, the intersection of the metric intervals RA = |l1 Ii where Ii = [min(d(xj,fi)), max(d(xj,fi))}, 1 <=i<=l, 1 <= j <=n.

  10. df1b df1a df2b df2a df1b df1a

  11. Cardinality of F • Good number for the cardinality of F would be between the next integer that contains the intrinsic dimension ceil(D2)+1 and 2*ceil(D2)+1.

  12. How to choose foci: HF-Algorithm s1 3 s4 5.5 3 7 10 s3 6 5 s5 2 s6 6 s2

  13. HF-Algorithm • HF-Algorithm practical: O(N) • Requires l*N distance calculations • Best foci algorithm O(N!/(N-l)!)

  14. Omni-sequential • Omni-sequential Calculate Ci Precede distance calculation by for fk ∈ F if | dfk(si) – dfk(sq) | > rq then skip distance calc.

  15. OmniB+-tree • Store Ci in l B+trees, one for each focus • Subsets Ik⊂ S are retrieved from corresponding b+-tree and used to generate mbOr. • Ik is objects between dfk(sq) – rq and dfk(sq) + rq • Calculate distance from sq to each obj in intersection.

  16. OmniR-tree • Algorithm to do insertion, node partitioning, range queries are same. • KNN requires NN algorithm used in metric tree. A deep search first preformed to find k-candidates. Continues reducing radius whenever the furthest neighbor is replaced, until every entry that overlaps the radius in the query has been tested.

  17. OmniR-tree • Requires an R tree to store Ci • Requires a page direct access file to store the objects in the dataset. • When a leaf in R tree is retrieved, and the Ci stored in this node qualify objects, the actual distance is calculated.

  18. Graph’s prove intrinsic dimensionality of the data is a good reference for the number of foci.

  19. Review • Reduce the number of distance calculations • The Omni-Family • Select a set of foci • Gauge all other objects with their distance from this set • The foci increase the pruning of distance calculations • Scalable

More Related