- By
**roy** - Follow User

- 125 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Feature Based Similarity' - roy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel21University for Health Informatics and Technology, Innsbruck2University of MunichOptimal Dimension Order: A Generic Technique for the Similarity Join

Simple Similarity Queries

- Specify query object and
- Find similar objects – range query
- Find the k most similar objects – nearest neighbor q.

Join Applications: Clustering

- Clustering (e.g. DBSCAN)
- Similarity self-join

R-Tree Similarity Join

- Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R

S

The e-kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

- Assumption: 2 adjacent e-stripes fit in main mem.
- Unrealistic for large data sets which are ...
- clustered,
- skewed and
- high-dimensional data

Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

Common Properties

- Decomposition of data/space into regions
- Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) £e for each pair of points (p,q) on (P,Q) testdist (p,q) £e ;
- Most CPU-effort in distance test between vectors:ÞIdea: Speed-up distance test

Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

- Observations:
- More efficient to use x-axis as sweep direction.
- Projection of polygons to y-axis yield high overlap
- Decide by projections of the bounding boxes(integrate a pdf)

Feature Vectors in the Similarity Join

- Distance computation between feature vectors p,q for (i=0 ; i<d ; i++) { dist2 = dist2 + (p[i] - q[i])2 ; if (dist2 > e2) break ; }
- Order dimensions by Mating Probability (increasing)

d1

d0

Computation of the Mating Probability

To determine mating probability for di:

- Project bounding boxes on di-axis

d1

d0

d0

d0

d0

d0

d0

Computation of the Mating ProbabilityTo determine mating probability for di:

- Project bounding boxes on di-axis
- Consider two projections in 2-dimensional space

d0

Computation of the Mating Probability

To determine mating probability for di:

- Project bounding boxes on di-axis
- Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

d0[P]

y ³ x - e

Computation of the Mating ProbabilityTo determine mating probability for di:

- Project bounding boxes on di-axis
- Consider two projections in 2-dimensional space

d0[Q]

d0-Projection

of each point

pair located in

this event space

mating point pairs on e-stripe

d0[P]

e

e

Computation of the Mating Probability

To determine mating probability for di:

- Project bounding boxes on di-axis
- Consider two projections in 2-dimensional space

d0[Q]

Mating

Probability

for d0

e

d0[P]

e

Optimal Dimension Order

- For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability
- Algorithm: for each pair (P,Q) of partitions having dist (P,Q) £e determine ODO ; for each pair of points (p,q) on (P,Q) testdist (p,q) £eusing ODO ;

Shape of the Intersection Area

- 20 different shapes are possible, e.g. 1223 2233 2223
- Easy proof of completeness and efficient case distinction by assigning codes to the corners
- 1: Corner is left or above the e-stripe
- 2: Corner is on the e-stripe
- 3: Corner is right or below the e-stripe
- Easy formulas (only 45° and 90° angles)

Experimental Evaluation: R-tree Sim. Join

- 8-dimensional data, uniformly distributed

Experimental Evaluation: R-tree Sim. Join

- 16-dimensional data, from CAD-similarity search

Experimental Evaluation: Scalability

EGO, CAD data

Conclusion

- Conclusion:
- Similarity join is an important database primitive for knowledge discovery in databases
- Many different basic algorithms
- Most accelerable by our optimal dimension order
- Future Work:
- New applications of the similarity join
- Further optimization (multi-parameter) of the sim. join
- Parallel and distributed environments

Download Presentation

Connecting to Server..