1 / 39

Filter Algorithms for Approximate String Matching

Filter Algorithms for Approximate String Matching. Stefan Burkhardt. Outline. Motivation Filter Algorithms Gapped q -grams Experimental Analysis. Problems and Motivation. Motivation Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse)

Download Presentation

Filter Algorithms for Approximate String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Filter Algorithms forApproximate String Matching Stefan Burkhardt

  2. Outline • Motivation • Filter Algorithms • Gapped q-grams • Experimental Analysis

  3. Problems and Motivation Motivation Computational Biology: • EST Clustering • Assembly • Genome comparison (e.g. Human/Mouse) Information Retrieval • Phonebooks • Dictionaries • Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance

  4. Problems and Motivation The global approximate string matching problem Given a pattern P, a targetS, an error levelk and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance P GAT ACTGATAACGTTAGCCATGG S

  5. Problems and Motivation The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance P GAT ACTGATAACGTTAGCCATGG S

  6. Filter Algorithms Filter Algorithm Filtration Phase, apply Filter Criterion Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches S How? BLAST The q-gram Lemma and QUASAR P

  7. Filter Algorithms BLAST (Altschul, Karlin, et al.) : Sequential scan of S locates all matching q-grams with P How? BLAST The q-gram Lemma and QUASAR Iterative extension with cutoff to find good matches S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific

  8. Filter Algorithms Preprocess Indexed Filter Algorithm Index Potential Matches Exact Algorithm Verification Phase, examine Potential Matches True Matches False Matches S How? BLAST The q-gram Lemma and QUASAR P

  9. Filter Algorithms S Preprocess How? BLAST The q-gram Lemma and QUASAR P Indexed Filter Algorithm Index Potential Matches Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium

  10. Filter Algorithms S Preprocess How? BLAST The q-gram Lemma and QUASAR P Indexed Filter Algorithm Index Potential Matches QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection: overlapping rectangles in DP-Matrix

  11. Filter Algorithms How? BLAST The q-gram Lemma and QUASAR T C G C G A G A T A T T T T A T C G A T T A C T A C T C G C G A G A T A T T T T A T C G A A T A C T A C |P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 Each error can ´destroy´ q matching q-grams => for k errors lose kqq-grams The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).

  12. Filter Algorithms 3 hits 3 hits 2 hits t = 3 2 hits 1 hit Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match How? BLAST The q-gram Lemma and QUASAR S P

  13. Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match How? BLAST The q-gram Lemma and QUASAR S P S QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR)

  14. Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) : • BLAST for the verification of the potential matches • wider Rectangles as Match Regions • Index is a combination of Lookup Table and Suffix Array • used for EST-Clustering at the DKFZ in Heidelberg • searches for EST-Clustering about 30 times faster than BLAST How? BLAST The q-gram Lemma and QUASAR

  15. Gapped q-grams • A new (old?) idea • Hamming Distance • Finding good shapes

  16. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # # . # Match Don’t care General idea: • use gapped q-grams • call arrangement of gaps the shape

  17. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Previous work... • Califano, Rigoutsos (1993) • Pevzner, Waterman (1995) • Lehtinen, Sutinen, Tarhio (1996) • no exact threshold for the general case given • limited attention paid to choice of shapes Recently... • Buhler (2001) : Multiple Shapes • Ma, Tromp, Li (2002) : Pattern Hunter • threshold t = 1

  18. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k = 3 OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO OOOXXOOXOOO OO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O gapped3-shape ##.# k = 3 t = 1 t = 0 no filter!

  19. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors classic 3-shape ### k = 3 OOOXXOOXOOO OO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O gapped3-shape ##.# k = 3 t = 1 t = 0 no filter! • gapped shapes can have higher(!) thresholds t than ungapped shapes • no simple formula for t • we used a DP-based approach to compute t

  20. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes low low tradeoff line # of potential matches verific. time good filters bad filters high high high low high # of q-gram hits filtration time q high low low Finding good shapes

  21. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes low tradeoff line # of potential matches bad filters good filters high low q high Finding good shapes ? 1 # of q-gram hits |S|  |S|q

  22. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Reason: ##.# ### ##.# ### ----- ---- 5 4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

  23. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes Finding good shapes We define the minimum coveragecm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S CGACGATTGAT ##.# ##.# ----- ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5

  24. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes high tradeoff line cm bad filters good filters low low q high 1 |S|  |S|cm Finding good shapes # of potential matches 1 # of q-gram hits |S|  |S|q

  25. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes median t = 1 t = 2 t = 3 t = 4 t = 5 600 contiguous best number of shapes with given minimum coverage for k = 5 q = 8 400 200 0 8 10 12 14 16 18 20 22 minimum coverage Finding good shapes • compute t and minimum coverage for all shapes with • |P|=50 and k=3,4,5,6

  26. Experimental Analysis • Speed and Filtration Efficiency • The Heuristic Zone

  27. A few different Filters Speed and Filtration Efficiency The Heuristic Zone 2-4 1 24 28 212 216 Experimental Analysis 2-8 k = 5 |P| = 50 |S| = 50Mbps matches 24 222 220 218 216 214 212 hits 20 gapped, Hamming contiguous minimum coverage 16 12 8 6 7 8 9 10 11 12 q

  28. From Hits to Matches Describing Filter Properties 100% Recognition rate 0% 0 Errors |P| Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches)

  29. From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches) 100% Recognition rate 0% 0 k Errors |P|

  30. From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches) 100% Recognition rate 0% 0 k Errors |P|-mc |P|

  31. From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : Guarantee zone (finds all approximate matches) Heuristic zone (finds some of the approximate matches) Negative zone (guaranteed not to find matches) 100% Recognition rate 0% 0 k Errors |P|-mc |P|

  32. A few different Filters Speed and Filtration Efficiency The Heuristic Zone Heuristic Zone Experimental Analysis Problem: Behaviour in the Heuristic Zone hard to predict 100% Recognition rate 0% 0 k Errors |P|-mc |P|

  33. A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate ssamplestrings with irandomerrors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone

  34. A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=9 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous Recognition rate 0% 0 10 15 20 25 5 30 Errors

  35. A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=11 k=5, q=10 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 gapped, edit Recognition rate 0% 0 10 15 20 25 5 30 Errors

  36. A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 gapped, edit BLAST Recognition rate 0% 0 10 15 20 25 5 30 Errors

  37. A few different Filters Speed and Filtration Efficiency The Heuristic Zone k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 gapped, edit BLAST Recognition rate 0% 0 10 15 20 25 5 30 Errors

  38. A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis |P| = 50 1000 samples for each error level 100% contiguous k=3, q=11 k=4, q=9 k=3, q=11 gapped, edit k=4, q=11 k=5, q=10 k=3,q=11 BLAST k=4,q=10 Recognition rate 50% 0 5 10 15 Errors

  39. Conclusion - Future Work Our Work: • Significant sensitivity improvement over existing filters • Required modifications easy to implement • Methods for describing filter properties Future Work: • Combination of `orthogonal` shapes into one filter • Use of word neighborhoods • Database of filter properties for good shapes

More Related