1 / 34

Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model

Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model. Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/16/2011. Outline. Similarity query and biological applications

aldis
Download Presentation

Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/16/2011

  2. Outline • Similarity query and biological applications • Indexing for similarity query • Distance-based (metric space) indexing • The pivot space model

  3. r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results

  4. Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;

  5. Example 2: Gas station near Purdue

  6. Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios

  7. Image retrieval [CIT05]

  8. Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers

  9. Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance

  10. Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.

  11. 2. Indexing for similarity query • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning

  12. Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?

  13. x d(x,z) d(x,y) y z d(y,z) 3. Distance-based indexing Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

  14. How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101

  15. Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]

  16. C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L

  17. VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children

  18. C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)

  19. Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict

  20. 4. The pivot space model • metric space  Rk • multi-dimensional indexing  query cube • direct evaluation of cube General Methodology

  21. P S Pivot space Mapping: M  Rk :x  Pivot space: The image of S in Rk

  22. Complete pivot space Let all the points be pivots: P = S, M  Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|

  23. Distance-based indexing  High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space Are we done?

  24. Two distinctions • 1. Pivot selection vs. dimension reduction • 2. query ball vs. query cube

  25. 4.1 Pivot selection: Dimension reduction for distance-based indexing(SISAP2010 Best Paper) • Answer queries directly in the complete pivot space? Dimension reduction is inevitable. 2. Dimension reduction for the complete pivot space? Can only select existing dimensions: pivot selection 3. how to select pivots? Use Rn method to create new dimension Find closest existing dimension (pivot)

  26. y=d(p2, xi) L: y = x y=d(vp2, xi) 0 x=d(p1, xi) child4 child1 d21 d22 p1 child2 p2 child3 L 0 x=d(vp1, xi) d11 4.2 Hyperplane partition in pivot space General Hyperplane Tree (GHT) Multiple Vantage Point Tree (MVPT)

  27. p1 p2 p1 p2 Complete GHT GHT: pivot space CGHT: pivot space MVPT: pivot space MVPT: metric space CGHT: metric space

  28. r-neighborhood Nr(L), the r-neighborhood of a partition boundary L, is the neighborhood of L in the pivot space, into which if a query object q falls, R(q,r) could have results in both sides of L. • Assuming q has the same distribution as the database,|Nr(L)|dominates query performance. • Width & Density

  29. Nr(L): |x-μ|≤ r 2r L: x = μ • Special case: L: x = μ • Width = 2r d(p2, x) q Nr(L): |y-x| ≤ 2r r y = x + 2r L: y = x y = x – 2r 2r 2r -2r 0 0 d(p1, x) d(p1, x) (b) Special case: L: y = x Width = Min width of r-neighborhood MVPT partition has the minimal width of r-neighborhood

  30. |Nr(L)|: analytical & empirical • 2-d normal dist.: N(0, 1, 0, 1, -ρ), 0≤ρ≤1 • Empirical results |NLV(r)|∝ PLV(r) = P(|x| ≤ r | x~N(0,1)) |NLV(r)|∝ PLV(r) = P(|x| ≤ r | x~N(0,1))

  31. Dimension rotation might not be helpful! • A counter example

  32. Conclusions and Future work Conclusions • Distance-based indexing is a very general approach • Pivot space model establishes an analogy between distance-based indexing and high dimensional indexing. • There are two distinctions between them. Future work • Multi-dimensional/statistical methods • Non-linear partition • Moving to cloud environment • Applications.

  33. Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University

  34. Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng

More Related