Filter Algorithms for Approximate String Matching

Filter Algorithms forApproximate String Matching Stefan Burkhardt

Outline • Motivation • Filter Algorithms • Gapped q-grams • Experimental Analysis

Problems and Motivation Motivation Computational Biology: • EST Clustering • Assembly • Genome comparison (e.g. Human/Mouse) Information Retrieval • Phonebooks • Dictionaries • Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance

Problems and Motivation The global approximate string matching problem Given a pattern P, a targetS, an error levelk and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance P GAT ACTGATAACGTTAGCCATGG S

Problems and Motivation The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance P GAT ACTGATAACGTTAGCCATGG S

Filter Algorithms Filter Algorithm Filtration Phase, apply Filter Criterion Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches S How? BLAST The q-gram Lemma and QUASAR P

Filter Algorithms BLAST (Altschul, Karlin, et al.) : Sequential scan of S locates all matching q-grams with P How? BLAST The q-gram Lemma and QUASAR Iterative extension with cutoff to find good matches S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific

Filter Algorithms Preprocess Indexed Filter Algorithm Index Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches S How? BLAST The q-gram Lemma and QUASAR P

Filter Algorithms S Preprocess How? BLAST The q-gram Lemma and QUASAR P Indexed Filter Algorithm Index Potential Matches Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium

Filter Algorithms S Preprocess How? BLAST The q-gram Lemma and QUASAR P Indexed Filter Algorithm Index Potential Matches QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection: overlapping rectangles in DP-Matrix

Filter Algorithms How? BLAST The q-gram Lemma and QUASAR T C G C G A G A T A T T T T A T C G A T T A C T A C T C G C G A G A T A T T T T A T C G A A T A C T A C |P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 Each error can ´destroy´ q matching q-grams => for k errors lose kqq-grams The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).

Filter Algorithms 3 hits 3 hits 2 hits t = 3 2 hits 1 hit Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match How? BLAST The q-gram Lemma and QUASAR S P

Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match How? BLAST The q-gram Lemma and QUASAR S P S QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR)

Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) : • BLAST for the verification of the potential matches • wider Rectangles as Match Regions • Index is a combination of Lookup Table and Suffix Array • used for EST-Clustering at the DKFZ in Heidelberg • searches for EST-Clustering about 30 times faster than BLAST How? BLAST The q-gram Lemma and QUASAR

Gapped q-grams • A new (old?) idea • Hamming Distance • Finding good shapes

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # # . # Match Don’t care General idea: • use gapped q-grams • call arrangement of gaps the shape

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Previous work... • Califano, Rigoutsos (1993) • Pevzner, Waterman (1995) • Lehtinen, Sutinen, Tarhio (1996) • no exact threshold for the general case given • limited attention paid to choice of shapes Recently... • Buhler (2001) : Multiple Shapes • Ma, Tromp, Li (2002) : Pattern Hunter • threshold t = 1

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k = 3 OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO OOOXXOOXOOO OO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O gapped3-shape ##.# k = 3 t = 1 t = 0 no filter!

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k = 3 OOOXXOOXOOO OO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O gapped3-shape ##.# k = 3 t = 1 t = 0 no filter! • gapped shapes can have higher(!) thresholds t than ungapped shapes • no simple formula for t • we used a DP-based approach to compute t

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes low low tradeoff line # of potential matches verific. time good filters bad filters high high high low high # of q-gram hits filtration time q high low low Finding good shapes

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes low tradeoff line # of potential matches bad filters good filters high low q high Finding good shapes ? 1 # of q-gram hits |S|  |S|q

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Reason: ##.# ### ##.# ### ----- ---- 5 4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Finding good shapes We define the minimum coveragecm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S CGACGATTGAT ##.# ##.# ----- ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes high tradeoff line cm bad filters good filters low low q high 1 |S|  |S|cm Finding good shapes # of potential matches 1 # of q-gram hits |S|  |S|q

Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes median t = 1 t = 2 t = 3 t = 4 t = 5 600 contiguous best number of shapes with given minimum coverage for k = 5 q = 8 400 200 0 8 10 12 14 16 18 20 22 minimum coverage Finding good shapes • compute t and minimum coverage for all shapes with • |P|=50 and k=3,4,5,6

Experimental Analysis • Speed and Filtration Efficiency • The Heuristic Zone

A few different Filters Speed and Filtration Efficiency The Heuristic Zone 2-4 1 24 28 212 216 Experimental Analysis 2-8 k = 5 |P| = 50 |S| = 50Mbps matches 24 222 220 218 216 214 212 hits 20 gapped, Hamming contiguous minimum coverage 16 12 8 6 7 8 9 10 11 12 q

From Hits to Matches Describing Filter Properties 100% Recognition rate 0% 0 Errors |P| Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches)

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches) 100% Recognition rate 0% 0 k Errors |P|

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches) 100% Recognition rate 0% 0 k Errors |P|-mc |P|

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Heuristic Zone Experimental Analysis Problem: Behaviour in the Heuristic Zone hard to predict 100% Recognition rate 0% 0 k Errors |P|-mc |P|

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate ssamplestrings with irandomerrors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone

A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=9 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous Recognition rate 0% 0 10 15 20 25 5 30 Errors

A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=11 k=5, q=10 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 gapped, edit Recognition rate 0% 0 10 15 20 25 5 30 Errors

A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 gapped, edit BLAST Recognition rate 0% 0 10 15 20 25 5 30 Errors

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 k=3, q=11 gapped, edit k=4, q=11 k=5, q=10 k=3,q=11 BLAST k=4,q=10 Recognition rate 50% 0 5 10 15 Errors

Conclusion - Future Work Our Work: • Significant sensitivity improvement over existing filters • Required modifications easy to implement • Methods for describing filter properties Future Work: • Combination of `orthogonal` shapes into one filter • Use of word neighborhoods • Database of filter properties for good shapes

Filter Algorithms for Approximate String Matching

Filter Algorithms for Approximate String Matching

Presentation Transcript

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

XML data management and approximate string matching

Efficient Merging and Filtering Algorithms for Approximate String Searches

Approximate String Matching

Efficient Merging and Filtering Algorithms for Approximate String Searches

Module 5: String Matching Algorithms

Rules for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching

Faster algorithms for string matching with k mismatches

Efficient Merging and Filtering Algorithms for Approximate String Searches

Two Different Approximate String Matching Problems and Their Algorithms

Faster algorithms for string matching problems: matching the convolution bound

Approximate Boyer-Moore String Matching

Exact String Matching Algorithms

Rules in Exact String Matching Algorithms

String Matching Algorithms

XML data management and approximate string matching

A fast algorithm for approximate string matching on gene sequences

Approximate String Matching

Lecture 27. String Matching Algorithms

String Matching Algorithms