slide1
Download
Skip this Video
Download Presentation
Feature Based Similarity

Loading in 2 Seconds...

play fullscreen
1 / 23

Feature Based Similarity - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join. Feature Based Similarity. Simple Similarity Queries.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Feature Based Similarity' - roy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel21University for Health Informatics and Technology, Innsbruck2University of MunichOptimal Dimension Order: A Generic Technique for the Similarity Join

simple similarity queries
Simple Similarity Queries
  • Specify query object and
    • Find similar objects – range query
    • Find the k most similar objects – nearest neighbor q.
join applications catalogue matching

R

S

Join Applications: Catalogue Matching
  • Catalogue matching
    • E.g. Astronomy catalogues
join applications clustering
Join Applications: Clustering
  • Clustering (e.g. DBSCAN)
  • Similarity self-join
r tree similarity join

e

R-Tree Similarity Join
  • Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R

S

the e kdb tree
The e-kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

  • Assumption: 2 adjacent e-stripes fit in main mem.
  • Unrealistic for large data sets which are ...
    • clustered,
    • skewed and
    • high-dimensional data
epsilon grid order
Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

common properties
Common Properties
  • Decomposition of data/space into regions
  • Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) £e for each pair of points (p,q) on (P,Q) testdist (p,q) £e ;
  • Most CPU-effort in distance test between vectors:ÞIdea: Speed-up distance test
related work plane sweep for polygons
Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

  • Observations:
    • More efficient to use x-axis as sweep direction.
    • Projection of polygons to y-axis yield high overlap
    • Decide by projections of the bounding boxes(integrate a pdf)
feature vectors in the similarity join
Feature Vectors in the Similarity Join
  • Distance computation between feature vectors p,q for (i=0 ; i<d ; i++) { dist2 = dist2 + (p[i] - q[i])2 ; if (dist2 > e2) break ; }
  • Order dimensions by Mating Probability (increasing)

d1

d0

computation of the mating probability
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis

d1

d0

computation of the mating probability1

d0

d0

d0

d0

d0

d0

Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis
  • Consider two projections in 2-dimensional space

d0

computation of the mating probability2
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis
  • Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

d0[P]

computation of the mating probability3

y £ x + e

y ³ x - e

Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis
  • Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

mating point pairs on e-stripe

d0[P]

e

e

computation of the mating probability4
Computation of the Mating Probability

To determine mating probability for di:

  • Project bounding boxes on di-axis
  • Consider two projections in 2-dimensional space

d0[Q]

Mating

Probability

for d0

e

d0[P]

e

optimal dimension order
Optimal Dimension Order
  • For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability
  • Algorithm: for each pair (P,Q) of partitions having dist (P,Q) £e determine ODO ; for each pair of points (p,q) on (P,Q) testdist (p,q) £eusing ODO ;
shape of the intersection area
Shape of the Intersection Area
  • 20 different shapes are possible, e.g. 1223 2233 2223
  • Easy proof of completeness and efficient case distinction by assigning codes to the corners
    • 1: Corner is left or above the e-stripe
    • 2: Corner is on the e-stripe
    • 3: Corner is right or below the e-stripe
  • Easy formulas (only 45° and 90° angles)
experimental evaluation r tree sim join
Experimental Evaluation: R-tree Sim. Join
  • 8-dimensional data, uniformly distributed
experimental evaluation r tree sim join1
Experimental Evaluation: R-tree Sim. Join
  • 16-dimensional data, from CAD-similarity search
experimental evaluation scalability
Experimental Evaluation: Scalability

MuX, uniform data

Z-RSJ, uniform data

conclusion
Conclusion
  • Conclusion:
    • Similarity join is an important database primitive for knowledge discovery in databases
    • Many different basic algorithms
    • Most accelerable by our optimal dimension order
  • Future Work:
    • New applications of the similarity join
    • Further optimization (multi-parameter) of the sim. join
    • Parallel and distributed environments
ad