260 likes | 385 Views
This paper presents novel approaches to the Substring Near Neighbor (SNN) problem, critical in applications like computational biology, specifically text indexing with mismatches. We construct a data structure for given text and query patterns, enabling efficient identification of near neighbor substrings. By utilizing Locality-Sensitive Hashing (LSH), we optimize both space and query time without prior knowledge of the pattern length. Our findings indicate significant performance improvements in space complexity and query efficiency, making it a valuable contribution to the field of approximate text searching.
E N D
Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT
What’s SNN? • SNN ≈ Text Indexing with mismatches • Text Indexing: • Construct a data structure on a text T[1..n], s.t. • Given query P[1..m], finds occurrences of P in T • Text indexing with mismatches: • Given P, find the substrings of T that are equal to P except ≤R chars. • Motivation: e.g., computational bio (BLAST) T= GAGTAACTCAATA T= GAGTAACTCAATA P= AGTA
Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks
Approach (Or, why SNN?) • SNN = a near neighbor problem in Hamming metric with m dimensions: • Construct data structure on D={all substrings of T of length m}, s.t. • Given P, find a point in D that is at distance ≤R from P • Use a NN data structure for Hamming T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …. AATA} P= AGTA
Approximate NN • Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time) • Approximate NN is easier • Defined for approximation c=1+ε as • OK to report a point at distance ≤cR (when there is a point at distance ≤R) cR R q
Our contribution • Problem: need m in advance for NN • Have to construct a data structure for each m≤M • Here: approx SNN data structure for unknown m • Without degradation in space or query time • Our algorithm for SNN based on LSH: • Supports patterns of length m≤M • Optimal* space: n1+1/c • Optimal* query time: n1/c • Slightly worse preprocessing time if c>3 • (* Optimal w.r.t. LSH, modulo subpoly factors) • Also extends to l1
Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks
Locality-Sensitive Hashing • Based on a family of hash functions {g} • For points P[1..m], Q[1..m]: • If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” • If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low” • Idea: • Construct L hash tables with random g1, g2, … gL • For query P, look at buckets g1(P), g2(P)… gL(P) • Space: L*n • Query time: L
T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA … P= AGTA LSH for Hamming • Hash function g: • Projection on k random coordinates • E.g.: g1(“AGTA”)=“AA” (k=2) • L=#hash tables=n1/c • k=|log n / log(1-cR/m)| < m * log n R=1
Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks
T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …} HT1: GG-> GAG AT-> AGG, ACT,… … P= AGT R=1 Unknown m • Bad news • k dependent on m! • Distinct m distinct hash tables g1(“AGT”)=“AT”
T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Solution • Let’s just reuse the same data structure for all m • g(“AGTA”)=“AA” • On “AGT” have to guess last char • g(“AGT?”)=g(“AGT?”) = “A?” • Like in [exact] text indexing…
A … G A C A T T … … AGTA AATA ACTC AACT T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Tries*! AGT AGTA • Replace HT1 with trie on g1(suffixes) • Stop search when outside P • Same analysis! * Tries have been used with LSH before in [MS02], but in a different context
Resulting performance • Space: • n1+1/c (using compressed tries, one trie takes n space) • Optimal! • Query time: • n1/c * m (m=length P) • Not [yet] really optimal: originally, could do dim-reduction • Can improve to n1/c + mno(1) • Preprocessing time: • n1+1/c * M (M=max m) • Not optimal (optimal = n1+1/c) • Can improve to n1+1/c + M1/3 * n1+o(1) • Optimal for c<3
Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks
Better query & preprocessing • Redesign LSH to improve query and preprocessing: • Query: n1/c * m n1/c + mno(1) • Preprocessing: n1+1/c * M n1+1/c + n1+o(1) * M • Idea for new LSH • Use same # of hash tables/tries (#=L= n1/c) • But use “less randomness” in choosing hash functions g1, g2, …gL • S.t., each gi looks random, but g’s are not independent
New LSH scheme • Old scheme: • Choose L hash functions gi • Each gi = projection on k random coordinates • New scheme: • Construct the L functions gi from a smaller number of “base” hash functions • A “base” hash function = projection on k/2 random coordinates • {gi ,i =1..L} = all pairs of “base” hash functions • Need only ~L1/2 “base” hash functions!
u1= u2= u3= u4= g1=<u1, u2>= g2=<u1, u3>= g3=<u1, u4>= . . . Example k=4 w= #base fns=4 L=(w choose 2)=(4 choose 2)=6
Saving time • Can save time since there are less “base” hash functions • E.g.: computing fingerprints • Want to compute FP(gi(P)) for i=1..L • FP(gi(P))=(Σj P[j] * χji * 2j) mod prime • Old way • Would take L * m time for L functions g • New way • Takes L1/2 * m time for L1/2 functions ui • Need only L time to combine FP(u(P)) into FP(g(P)) • If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime • Total: L + L1/2 * m
Better query & preproc (2) • E.g., for query • Use fingerprints to leap faster in the trie • Yields time n1/c + n1/(2c) * m (since L= n1/c) • To get n1/c + no(1) * m, generalize: • g = tuple of t base functions • a base function = k/t random coordinates • Other details similar to fingerprints
Better preprocessing (3) • Preprocessing, can get • n1+1/c + n1+o(1) * M • Can get n1+1/c + n1+o(1) * M1/3 • Can construct a trie in n * M1/3 (instead on n * M) • Using FFT, etc
Outline • General approach • View: Near Neighbor problem in Hamming metric • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution = LSH + Tries • Reducing query & preprocessing • Redesign LSH • Concluding remarks
Conclusions • Problem: • Substring Near Neighbor (a.k.a., text indexing with mismatches) • Approach: • View as NN in m-dimensional Hamming • Use LSH • Challenge: • Variable-length pattern w/o degradation in performance • Solution: • Space/query optimal (w.r.t. LSH) • Preprocessing optimal (w.r.t. LSH) for c<3
Extensions • Extends to l1 • Nontrivial since a need a quite different LSH functions • Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3 • Using “Less-than-matching” problem [Amir-Farach’95]
Remarks • Other approaches? • Or, why LSH for SNN? • Since better SNN better NN… • And LSH is the “best” known algorithm for high-dimensional NN (using reasonable space)