Christian Böhm
Download
1 / 23

Feature Based Similarity - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join. Feature Based Similarity. Simple Similarity Queries.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Feature Based Similarity' - roy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel21University for Health Informatics and Technology, Innsbruck2University of MunichOptimal Dimension Order: A Generic Technique for the Similarity Join



Simple similarity queries
Simple Similarity Queries

  • Specify query object and

    • Find similar objects – range query

    • Find the k most similar objects – nearest neighbor q.


Join applications catalogue matching

R

S

Join Applications: Catalogue Matching

  • Catalogue matching

    • E.g. Astronomy catalogues


Join applications clustering
Join Applications: Clustering

  • Clustering (e.g. DBSCAN)

  • Similarity self-join


R tree similarity join

e

R-Tree Similarity Join

  • Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R

S


The e kdb tree
The e-kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

  • Assumption: 2 adjacent e-stripes fit in main mem.

  • Unrealistic for large data sets which are ...

    • clustered,

    • skewed and

    • high-dimensional data


Epsilon grid order
Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]


Common properties
Common Properties

  • Decomposition of data/space into regions

  • Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) £e for each pair of points (p,q) on (P,Q) testdist (p,q) £e ;

  • Most CPU-effort in distance test between vectors:ÞIdea: Speed-up distance test


Related work plane sweep for polygons
Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

  • Observations:

    • More efficient to use x-axis as sweep direction.

    • Projection of polygons to y-axis yield high overlap

    • Decide by projections of the bounding boxes(integrate a pdf)


Feature vectors in the similarity join
Feature Vectors in the Similarity Join

  • Distance computation between feature vectors p,q for (i=0 ; i<d ; i++) { dist2 = dist2 + (p[i] - q[i])2 ; if (dist2 > e2) break ; }

  • Order dimensions by Mating Probability (increasing)

d1

d0


Computation of the mating probability
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

d1

d0


Computation of the mating probability1

d0

d0

d0

d0

d0

d0

Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

  • Consider two projections in 2-dimensional space

d0


Computation of the mating probability2
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

  • Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

d0[P]


Computation of the mating probability3

y £ x + e

y ³ x - e

Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

  • Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

mating point pairs on e-stripe

d0[P]

e

e


Computation of the mating probability4
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

  • Consider two projections in 2-dimensional space

d0[Q]

Mating

Probability

for d0

e

d0[P]

e


Optimal dimension order
Optimal Dimension Order

  • For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability

  • Algorithm: for each pair (P,Q) of partitions having dist (P,Q) £e determine ODO ; for each pair of points (p,q) on (P,Q) testdist (p,q) £eusing ODO ;


Shape of the intersection area
Shape of the Intersection Area

  • 20 different shapes are possible, e.g. 1223 2233 2223

  • Easy proof of completeness and efficient case distinction by assigning codes to the corners

    • 1: Corner is left or above the e-stripe

    • 2: Corner is on the e-stripe

    • 3: Corner is right or below the e-stripe

  • Easy formulas (only 45° and 90° angles)


Experimental evaluation r tree sim join
Experimental Evaluation: R-tree Sim. Join

  • 8-dimensional data, uniformly distributed


Experimental evaluation r tree sim join1
Experimental Evaluation: R-tree Sim. Join

  • 16-dimensional data, from CAD-similarity search


Experimental evaluation scalability
Experimental Evaluation: Scalability

MuX, uniform data

Z-RSJ, uniform data



Conclusion
Conclusion

  • Conclusion:

    • Similarity join is an important database primitive for knowledge discovery in databases

    • Many different basic algorithms

    • Most accelerable by our optimal dimension order

  • Future Work:

    • New applications of the similarity join

    • Further optimization (multi-parameter) of the sim. join

    • Parallel and distributed environments


ad