Efficient Algorithms for Substring Near Neighbor Problem

Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

What’s SNN? • SNN ≈ Text Indexing with mismatches • Text Indexing: • Construct a data structure on a text T[1..n], s.t. • Given query P[1..m], finds occurrences of P in T • Text indexing with mismatches: • Given P, find the substrings of T that are equal to P except ≤R chars. • Motivation: e.g., computational bio (BLAST) T= GAGTAACTCAATA T= GAGTAACTCAATA P= AGTA

Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks

Approach (Or, why SNN?) • SNN = a near neighbor problem in Hamming metric with m dimensions: • Construct data structure on D={all substrings of T of length m}, s.t. • Given P, find a point in D that is at distance ≤R from P •  Use a NN data structure for Hamming T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …. AATA} P= AGTA

Approximate NN • Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time) • Approximate NN is easier • Defined for approximation c=1+ε as • OK to report a point at distance ≤cR (when there is a point at distance ≤R) cR R q

Our contribution • Problem: need m in advance for NN • Have to construct a data structure for each m≤M • Here: approx SNN data structure for unknown m • Without degradation in space or query time • Our algorithm for SNN based on LSH: • Supports patterns of length m≤M • Optimal* space: n1+1/c • Optimal* query time: n1/c • Slightly worse preprocessing time if c>3 • (* Optimal w.r.t. LSH, modulo subpoly factors) • Also extends to l1

Locality-Sensitive Hashing • Based on a family of hash functions {g} • For points P[1..m], Q[1..m]: • If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” • If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low” • Idea: • Construct L hash tables with random g1, g2, … gL • For query P, look at buckets g1(P), g2(P)… gL(P) • Space: L*n • Query time: L

T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA … P= AGTA LSH for Hamming • Hash function g: • Projection on k random coordinates • E.g.: g1(“AGTA”)=“AA” (k=2) • L=#hash tables=n1/c • k=|log n / log(1-cR/m)| < m * log n R=1

T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …} HT1: GG-> GAG AT-> AGG, ACT,… … P= AGT R=1 Unknown m • Bad news • k dependent on m! • Distinct m  distinct hash tables g1(“AGT”)=“AT”

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Solution • Let’s just reuse the same data structure for all m • g(“AGTA”)=“AA” • On “AGT”  have to guess last char • g(“AGT?”)=g(“AGT?”) = “A?” • Like in [exact] text indexing…

A … G A C A T T … … AGTA AATA ACTC AACT T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Tries*! AGT AGTA • Replace HT1 with trie on g1(suffixes) • Stop search when outside P • Same analysis! * Tries have been used with LSH before in [MS02], but in a different context

Resulting performance • Space: • n1+1/c (using compressed tries, one trie takes n space) • Optimal! • Query time: • n1/c * m (m=length P) • Not [yet] really optimal: originally, could do dim-reduction • Can improve to n1/c + mno(1) • Preprocessing time: • n1+1/c * M (M=max m) • Not optimal (optimal = n1+1/c) • Can improve to n1+1/c + M1/3 * n1+o(1) • Optimal for c<3

Better query & preprocessing • Redesign LSH to improve query and preprocessing: • Query: n1/c * m  n1/c + mno(1) • Preprocessing: n1+1/c * M  n1+1/c + n1+o(1) * M • Idea for new LSH • Use same # of hash tables/tries (#=L= n1/c) • But use “less randomness” in choosing hash functions g1, g2, …gL • S.t., each gi looks random, but g’s are not independent

New LSH scheme • Old scheme: • Choose L hash functions gi • Each gi = projection on k random coordinates • New scheme: • Construct the L functions gi from a smaller number of “base” hash functions • A “base” hash function = projection on k/2 random coordinates • {gi ,i =1..L} = all pairs of “base” hash functions • Need only ~L1/2 “base” hash functions!

u1= u2= u3= u4= g1=<u1, u2>= g2=<u1, u3>= g3=<u1, u4>= . . . Example k=4 w= #base fns=4 L=(w choose 2)=(4 choose 2)=6

Saving time • Can save time since there are less “base” hash functions • E.g.: computing fingerprints • Want to compute FP(gi(P)) for i=1..L • FP(gi(P))=(Σj P[j] * χji * 2j) mod prime • Old way • Would take L * m time for L functions g • New way • Takes L1/2 * m time for L1/2 functions ui • Need only L time to combine FP(u(P)) into FP(g(P)) • If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime • Total: L + L1/2 * m

Better query & preproc (2) • E.g., for query • Use fingerprints to leap faster in the trie • Yields time n1/c + n1/(2c) * m (since L= n1/c) • To get n1/c + no(1) * m, generalize: • g = tuple of t base functions • a base function = k/t random coordinates • Other details similar to fingerprints

Better preprocessing (3) • Preprocessing, can get • n1+1/c + n1+o(1) * M • Can get n1+1/c + n1+o(1) * M1/3 • Can construct a trie in n * M1/3 (instead on n * M) • Using FFT, etc

Outline • General approach • View: Near Neighbor problem in Hamming metric • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution = LSH + Tries • Reducing query & preprocessing • Redesign LSH • Concluding remarks

Conclusions • Problem: • Substring Near Neighbor (a.k.a., text indexing with mismatches) • Approach: • View as NN in m-dimensional Hamming • Use LSH • Challenge: • Variable-length pattern w/o degradation in performance • Solution: • Space/query optimal (w.r.t. LSH) • Preprocessing optimal (w.r.t. LSH) for c<3

Extensions • Extends to l1 • Nontrivial since a need a quite different LSH functions • Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3 • Using “Less-than-matching” problem [Amir-Farach’95]

Remarks • Other approaches? • Or, why LSH for SNN? • Since better SNN  better NN… • And LSH is the “best” known algorithm for high-dimensional NN (using reasonable space)

Thanks!

Efficient Algorithms for Substring Near Neighbor Problem

Efficient Algorithms for Substring Near Neighbor Problem

Presentation Transcript

Algorithms for Nearest Neighbor Search

Introduction: Efficient Algorithms for the Problem of Computing Fibonocci Numbers

Lecture 3 Nearest Neighbor Algorithms

Efficient Algorithms for Matching

Efficient algorithms for Steiner Tree Problem

Efficient Algorithms for Neighbor Discovery in Wireless Networks

Energy-Efficient Algorithms

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem.

The Longest Common Substring Problem

Efficient and Effective Practical Algorithms for the Set-Covering Problem

Algorithms for Efficient Collaborative Filtering

Efficient Algorithms to Monitor Continuous Constrained k Nearest Neighbor Queries

Exact Nearest Neighbor Algorithms

Efficient heuristic algorithms for the maximum subarray problem

Near-Neighbor Search

Introduction: Efficient Algorithms for the Problem of Computing Fibonocci Numbers

Efficient Algorithms for Motif Search

Efficient Algorithms for Neighbor Discovery in Wireless Networks

Near-Neighbor Search