370 likes | 464 Views
This research paper explores pivot selection and PCA for distance-based indexing in metric spaces, offering insights into efficient data lookup methods minimizing distance calculations. The study discusses future directions in the field of similarity query and applications, aiming for fast data retrieval with optimized indexing techniques. Additionally, it outlines various distance-based indexing methodologies such as B-tree and kd-tree, highlighting their advantages and disadvantages in different data types. The paper also delves into the challenges of high-dimensional indexing and presents innovative solutions for improved data partitioning and querying.
E N D
Pivot Selection: Dimension Reduction for Distance-based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011
Outline • Similarity query and applications • Distance-based (metric space) indexing • Pivot selection • PCA for distance-based indexing • Future direction
r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results
Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;
Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios
Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers
Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance
Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.
2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning
Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?
x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)
How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101
Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]
C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L
VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children
C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)
Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper
General Methodology • metric space Rk (pivot selection) • multi-dimensional indexing query cube (data partition) • direct evaluation of cube (post processing)
P S Pivot space Mapping: M Rk :x Pivot space: The image of S in Rk
Complete pivot space Let all the points be pivots: P = S, M Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|
Distance-based indexing High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction (pivot selection) Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space
3. Pivot selection Dimension reduction for distance-based indexing • answer queries directly in the complete pivot space? • dimension reduction for the complete pivot space? • why is pivot selection important? • how to select pivots?
3.1 Answer queries directly in the complete pivot space? Theorem: Evaluation of similarity queries in the complete pivot space degrades the query performance to linear scan. • Dimension reduction is inevitable
3.2 Dimension reduction for the complete pivot space? Theorem: If a dimension reduction technique creates new dimensions based on all existing dimensions, evaluation of similarity queries degrades to a linear scan Can only select existing dimensions: pivot selection
A B C 0 1 2 x 3 C A B B A C 0 0 1 1 2 2 d(x,A) d(x,C) 3 3 B A,C 0 1 2 d(x,B) 3 3. 3 Why is pivot selection important? • Building index tree: a process of information loss • Information available to data partition is determined by pivot selection A B C 1 2 3 original value --------------------------------- 0 1 2 pivot: A 2 1 0 pivot: C 1 0 1 pivot: B
Importance of pivot selection Uniformly distributed points in unit square Figure of distances to the pivots Pivots: (0,0) and (1,1) Pivots: (0,0) and (1,0)
Importance of pivot selection 14-bit Hamming strings (“0/1” strings) Figure of distances to the pivots Pivots: opposite corners Pivots: neighboring corners 00 0000 0000 0000 00 0000 0000 0000 11 1111 1111 1111 00 0000 0111 1111
3.4 How to select pivots? • Heuristic: for each new dimension, select the point with the largest projection on that new dimension in the pivot space. • Using of mathematical tools in Rn • Yet what is a good objective function for pivot selection? • Empirically: select pivot, build tree, run queries: average speed of query
4. PCA for distance-based indexing • Pivot selection • Estimate the intrinsic dimension
PCA for pivot selection • PCA for the complete pivot space. • Apply the heuristic: for each PC, find the most similar (minimal angle) dimension(point) in the complete pivot space
Estimate the intrinsic dimension • Pair wise distances ρ =μ2 /2σ2 2. |Range(q,r)| ∝ rd • Linear regression: log(|Range(q,r)|) = dlog(r) + c 3. Where PCA eigenvalues change the most: • argmaxi (λi / λi+1)
5. Future work • Other dimension reduction methods • Objective function of pivot selection • Pair-wise distance • Multi-variable regression method • Forward selection, backward elimination • Choice of y, mean? Standard deviation? • Variable selection • Non-linear regression
Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University Rui Mao, Willard L. Miranker and Daniel P. Miranker, “Dimension Reduction for Distance-Based Indexing”, in the Proceedings of the Third International Conference on SImilarity Search and APplications (SISAP2010),pages 25-32, Istanbul, Turkey,September 18 - 19, 2010.
Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng