Advanced Dimension Reduction Techniques for Efficient Indexing

Pivot Selection: Dimension Reduction for Distance-based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011

Outline • Similarity query and applications • Distance-based (metric space) indexing • Pivot selection • PCA for distance-based indexing • Future direction

r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results

Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;

Example 2: Restaurants around UCF

Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios

Image retrieval [CIT05]

Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers

Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance

Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.

2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning

Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?

x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101

Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]

C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L

VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children

C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)

Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper

General Methodology • metric space  Rk (pivot selection) • multi-dimensional indexing  query cube (data partition) • direct evaluation of cube (post processing)

P S Pivot space Mapping: M  Rk :x  Pivot space: The image of S in Rk

Complete pivot space Let all the points be pivots: P = S, M  Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|

Distance-based indexing  High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction (pivot selection) Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space

3. Pivot selection Dimension reduction for distance-based indexing • answer queries directly in the complete pivot space? • dimension reduction for the complete pivot space? • why is pivot selection important? • how to select pivots?

3.1 Answer queries directly in the complete pivot space? Theorem: Evaluation of similarity queries in the complete pivot space degrades the query performance to linear scan. • Dimension reduction is inevitable

3.2 Dimension reduction for the complete pivot space? Theorem: If a dimension reduction technique creates new dimensions based on all existing dimensions, evaluation of similarity queries degrades to a linear scan Can only select existing dimensions: pivot selection

A B C 0 1 2 x 3 C A B B A C 0 0 1 1 2 2 d(x,A) d(x,C) 3 3 B A,C 0 1 2 d(x,B) 3 3. 3 Why is pivot selection important? • Building index tree: a process of information loss • Information available to data partition is determined by pivot selection A B C 1 2 3 original value --------------------------------- 0 1 2 pivot: A 2 1 0 pivot: C 1 0 1 pivot: B

Importance of pivot selection Uniformly distributed points in unit square Figure of distances to the pivots Pivots: (0,0) and (1,1) Pivots: (0,0) and (1,0)

Importance of pivot selection 14-bit Hamming strings (“0/1” strings) Figure of distances to the pivots Pivots: opposite corners Pivots: neighboring corners 00 0000 0000 0000 00 0000 0000 0000 11 1111 1111 1111 00 0000 0111 1111

3.4 How to select pivots? • Heuristic: for each new dimension, select the point with the largest projection on that new dimension in the pivot space. • Using of mathematical tools in Rn • Yet what is a good objective function for pivot selection? • Empirically: select pivot, build tree, run queries: average speed of query

4. PCA for distance-based indexing • Pivot selection • Estimate the intrinsic dimension

PCA for pivot selection • PCA for the complete pivot space. • Apply the heuristic: for each PC, find the most similar (minimal angle) dimension(point) in the complete pivot space

Estimate the intrinsic dimension • Pair wise distances ρ =μ2 /2σ2 2. |Range(q,r)| ∝ rd • Linear regression: log(|Range(q,r)|) = dlog(r) + c 3. Where PCA eigenvalues change the most: • argmaxi (λi / λi+1)

Intrinsic dimension

5. Future work • Other dimension reduction methods • Objective function of pivot selection • Pair-wise distance • Multi-variable regression method • Forward selection, backward elimination • Choice of y, mean? Standard deviation? • Variable selection • Non-linear regression

Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University Rui Mao, Willard L. Miranker and Daniel P. Miranker, “Dimension Reduction for Distance-Based Indexing”, in the Proceedings of the Third International Conference on SImilarity Search and APplications (SISAP2010),pages 25-32, Istanbul, Turkey,September 18 - 19, 2010.

Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng

Advanced Dimension Reduction Techniques for Efficient Indexing

Advanced Dimension Reduction Techniques for Efficient Indexing

Presentation Transcript

Pivot Table and Pivot Charts

FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interac

Dimension Reduction and Feature Selection

Fast Dimension Reduction MMDS 2008

TDJ3m Culminating Project

A Variance Reduction Framework for Stable Feature Selection

Introduction to Use of Pivot Tables

The Pivot Point Difference

Scalable High Performance Dimension Reduction

On Hyper-plane Partition of Distance-Based Indexing

Dimension Reduction for Under-sampled High Dimensional Data

Indexing

New Concepts in Pivot-Based Trading

Basic Text Processing and Indexing

Dimension Reduction

Chapter 6: Pivot Tables

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Coping with complexity: Model Reduction for the Simulation of Turbulent Reacting flows

Dimension IV of the EQuIP Rubric: Assessment