1 / 37

Pivot Selection: Dimension Reduction for Distance-based Indexing

Pivot Selection: Dimension Reduction for Distance-based Indexing. Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011. Outline. Similarity query and applications

tova
Download Presentation

Pivot Selection: Dimension Reduction for Distance-based Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pivot Selection: Dimension Reduction for Distance-based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011

  2. Outline • Similarity query and applications • Distance-based (metric space) indexing • Pivot selection • PCA for distance-based indexing • Future direction

  3. r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results

  4. Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;

  5. Example 2: Restaurants around UCF

  6. Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios

  7. Image retrieval [CIT05]

  8. Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers

  9. Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance

  10. Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.

  11. 2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning

  12. Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?

  13. x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

  14. How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101

  15. Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]

  16. C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L

  17. VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children

  18. C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)

  19. Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper

  20. General Methodology • metric space  Rk (pivot selection) • multi-dimensional indexing  query cube (data partition) • direct evaluation of cube (post processing)

  21. P S Pivot space Mapping: M  Rk :x  Pivot space: The image of S in Rk

  22. Complete pivot space Let all the points be pivots: P = S, M  Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|

  23. Distance-based indexing  High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction (pivot selection) Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space

  24. 3. Pivot selection Dimension reduction for distance-based indexing • answer queries directly in the complete pivot space? • dimension reduction for the complete pivot space? • why is pivot selection important? • how to select pivots?

  25. 3.1 Answer queries directly in the complete pivot space? Theorem: Evaluation of similarity queries in the complete pivot space degrades the query performance to linear scan. • Dimension reduction is inevitable

  26. 3.2 Dimension reduction for the complete pivot space? Theorem: If a dimension reduction technique creates new dimensions based on all existing dimensions, evaluation of similarity queries degrades to a linear scan Can only select existing dimensions: pivot selection

  27. A B C 0 1 2 x 3 C A B B A C 0 0 1 1 2 2 d(x,A) d(x,C) 3 3 B A,C 0 1 2 d(x,B) 3 3. 3 Why is pivot selection important? • Building index tree: a process of information loss • Information available to data partition is determined by pivot selection A B C 1 2 3 original value --------------------------------- 0 1 2 pivot: A 2 1 0 pivot: C 1 0 1 pivot: B

  28. Importance of pivot selection Uniformly distributed points in unit square Figure of distances to the pivots Pivots: (0,0) and (1,1) Pivots: (0,0) and (1,0)

  29. Importance of pivot selection 14-bit Hamming strings (“0/1” strings) Figure of distances to the pivots Pivots: opposite corners Pivots: neighboring corners 00 0000 0000 0000 00 0000 0000 0000 11 1111 1111 1111 00 0000 0111 1111

  30. 3.4 How to select pivots? • Heuristic: for each new dimension, select the point with the largest projection on that new dimension in the pivot space. • Using of mathematical tools in Rn • Yet what is a good objective function for pivot selection? • Empirically: select pivot, build tree, run queries: average speed of query

  31. 4. PCA for distance-based indexing • Pivot selection • Estimate the intrinsic dimension

  32. PCA for pivot selection • PCA for the complete pivot space. • Apply the heuristic: for each PC, find the most similar (minimal angle) dimension(point) in the complete pivot space

  33. Estimate the intrinsic dimension • Pair wise distances ρ =μ2 /2σ2 2. |Range(q,r)| ∝ rd • Linear regression: log(|Range(q,r)|) = dlog(r) + c 3. Where PCA eigenvalues change the most: • argmaxi (λi / λi+1)

  34. Intrinsic dimension

  35. 5. Future work • Other dimension reduction methods • Objective function of pivot selection • Pair-wise distance • Multi-variable regression method • Forward selection, backward elimination • Choice of y, mean? Standard deviation? • Variable selection • Non-linear regression

  36. Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University Rui Mao, Willard L. Miranker and Daniel P. Miranker, “Dimension Reduction for Distance-Based Indexing”, in the Proceedings of the Third International Conference on SImilarity Search and APplications (SISAP2010),pages 25-32, Istanbul, Turkey,September 18 - 19, 2010.

  37. Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng

More Related