1 / 39

Approximate String Matching

Approximate String Matching. A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman. Outline:. Definition of approximate string matching (ASM) Applications of ASM Algorithms Conclusion . Approximate string matching.

galya
Download Presentation

Approximate String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman

  2. Outline: • Definition of approximate string matching (ASM) • Applications of ASM • Algorithms • Conclusion

  3. Approximate string matching • Approximate string matching is the process of matching strings while allowing for errors.

  4. The edit distance • Strings are compared based on how close they are • This closeness is called the edit distance • The edit distance is summed up based on the number of operations required to transform one string into another

  5. Levenshtein / edit distance • Named after Vladimir Levenshtein who created his Levenshtein distance algorithm in 1965 • Accounts for three basic operations: • Inserts , deletions, and replacements • In the simplified version, all operations have a cost of 1 • Example: “mash” and “march” have edit distance of 2

  6. Other distance algorithms • Hamming distance: • Allows only substitutions with a cost of one each • Episode distance: • Allows only insertions with a cost of one each • Longest Common Subsequence distance: • Allows only insertions and deletions costing one each

  7. Outline: • What is approximate string matching (ASM)? • What are the applications of ASM? • Algorithms • Conclusion

  8. Applications • Computational biology • Signal processing • Information retrieval

  9. Computational biology • DNA is composed of Adenine, Cytosine, Guanine, and Thymine (A,C,G,T) • One can think of the set {A,C,G,T} as the alphabet for DNA sequences • Used to find specific, or similar DNA sequences • Knowing how different two sequences are can give insight to the evolutionary process.

  10. Signal processing • Used heavily in speech recognition software • Error correction for receiving signals • Multimedia and song recognition

  11. Information Retrieval • Spell checkers • Search engines • Web searches (Google) • Personal files (agrep for unix) • Searching texts with errors such as digitized books • Handwriting recognition

  12. Outline: • What is approximate string matching (ASM)? • What are the applications of ASM? • Algorithms • Conclusion

  13. Algorithms • Definitions • Dynamic Programming algorithms • Automatons • Bit-parallelism • Filters

  14. Definitions • Let ∑ be a finite alphabet of size |∑| = σ • Let Tє ∑* be a text of length n = |T| • Let Pє ∑* be a pattern of length m = |P| • Let k є R be the maximum error allowed • Let d : ∑* × ∑*  R be a distance function • Therefore, given T, P, k, and d(.), return the set of all text positions j such that there exists i such that d(P, Ti..j) ≤ k

  15. Algorithms • Definitions • Dynamic Programming algorithms • Automatons • Bit-parallelism • Filters

  16. Dynamic Programming • oldest to solve the problem of approximate string matching • Not very efficient • Runtime of O(|x||y|) • However, space is O(min(|x||y|)) • Most flexible when adapting to different distance functions

  17. Computing the edit distance • To compute the edit distance: ed(x,y) • Create a matrix C0..|x|,0..|y| where Ci,jrepresents the minimum operations needed to match x1..ito y1..j • Ci,0 = i • C0,j = j • Ci,j = if(xi = yj) then Ci-1, j-1 else 1 + min(Ci-1,Ci,j-1, Ci-1,j-1)

  18. Edit distance example • Ci,0 = i • C0,j = j • if(xi = yj) Ci,j = Ci-1, j-1 else Ci,j = 1 +min(Ci-1, Ci,j-1, Ci-1,j-1)

  19. Text searching • The previous algorithm can be converted to search a text for a given pattern with few changes • Let y = Pattern, and x = Text • Set C0,j = 0 so that any text position is the start of a match • Ci,j = if(Pi = Tj) then Ci-1,j-1 else 1+min(Ci-1,j, Ci,j-1, Ci-1,j-1)

  20. Text search example • In English: if the letters at the index are the same, then the current position = the top left position. If the letters are not the same, then the current position is the minimum of left, top, and top left plus one.

  21. Improvements • Example algorithm listed was the first • Many DP based algorithms improved on the search time • In 1992, Chang and Lampe produce new algorithm called “column partitioning” with an average search time of O(kn÷√σ) where k=errors, n=text length, and σ=size of alphabet

  22. Algorithms • Definitions • Dynamic Programming algorithms • Automatons • Bit-parallelism • Filters

  23. Automatons for approx. search • Model search with a nondeterministic finite automata • 1985: EskoUkkonen proposes a deterministic form • Fast: deterministic form has O(n) worst case search time • Large: space complexity of DFA grows exponentially with respect to the pattern length

  24. NFA example with k = 2 Matching the pattern “survey” on text “surgery”

  25. Improvements • In 1996 Kurtz[1996] proposes lazy construction of DFA • Space requirements reduced to O(mn)

  26. Algorithms • Definitions • Dynamic Programming algorithms • Automatons • Bit-parallelism • Filters

  27. Bit-parallelism • Takes advantage of the inherent parallelism of computer when dealing in bits • Changes an existing algorithm to operate at the bit level • Operations can be reduced by factor of w where w is the number of bits in a word

  28. Shift-Or • Was the first bit-parallel algorithm • Parallelizes the operation of an NFA that tries to match the pattern exactly • NFA has m+1 states

  29. Builds table B which stores a bit mask for every character c • For the mask B[c], the bit bi is set if and only if Pi = c • Search state is kept in a machine word D = dm..d1 • diis 1 when P1..imatches the end of the text scanned so far • Match is registered when dm = 1

  30. To start, D is set to 1m • D is updated upon reading a new text character using the following formula • D’  ((D << 1) | 0m-11) & B[Tj] • This representation ends up working similar to a DFA in that the final state is only reached if the previous state has been reached and so on.

  31. Algorithms • Definitions • Dynamic Programming algorithms • Automatons • Bit-parallelism • Filters

  32. Filters • Originating in the 1990’s • Filter algorithms attempt to filter out large sections of code based on the fact that a given pattern can not be there • Needs a different kind of algorithm to check portions of text which are not filtered out

  33. conceptually • Filter algorithms are really exact match pattern searchers • Exact pattern matching is much quicker • Breaks up original pattern into parts and searches the text for those exact parts • Example from Navarro: if “sur” and “vey” don’t appear in a section, then “survey” can’t either

  34. Filters • must be paired with a non-filter algorithm such as one of the dynamic programming algorithms • Performance dependant upon number of errors allowed • Are the fastest of the algorithms surveyed • Best theoretical average cost O(n(k + logσm)/m)

  35. Hierarchical verification method • Created by Navarro and Baeza-Yates in 1998 • Original pattern is recursively split with each half searching on k/2 errors • In example: if search on text “xxxbbxxxxxx”, the leaf “bbb” will return a match with one error • Checking the parent subdivision shows that there is no match

  36. Outline: • What is approximate string matching (ASM)? • What are the applications of ASM? • Algorithms • Conclusion

  37. Conclusion • Generally a combination of a fast filter and a fast verifying algorithm is the fastest overall • For non-filtering algorithms, a NFA bit-parallelized by diagonals is the fastest • Approximate string matching has greatly influenced the field of computer science and will play an important role in future technology.

  38. References • “A Guided Tour to Approximate String Matching”, Gonzalo Navarro • “Implementation of a Bit-parallel Aproximate String Matching Algorithm”, MikaelOnsjo and Osamu Watanabe • “A Partial Deterministic Automaton for Approximate String Matching”, Gonzalo Navarro • http://en.wikipedia.org/wiki/Approximate_string_matching

  39. Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman

More Related