1 / 56

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 17 Exact String Matching Algorithms. Boyer – Moore algorithm. Three ideas: Right-to-left comparison Bad character rule Good suffix rule. Boyer – Moore algorithm. Right to left comparison. x. y. Skip some chars without missing any occurrence. y.

courtney
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms

  2. Boyer – Moore algorithm • Three ideas: • Right-to-left comparison • Bad character rule • Good suffix rule

  3. Boyer – Moore algorithm • Right to left comparison x y Skip some chars without missing any occurrence. y

  4. Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position i = 5 5 – 3 = 2. so shift 2 P: tpabxab Restart the comparison here. Preprocessing O(n)

  5. (Strong) good suffix rule x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ z y P t’ t z ≠ y z y P t t’ x t T z z y P t’ t’ t z z y P t’ t’ t

  6. Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 2 0 0 2 0 0 dabcab dabdabcabdab Where to shift depends on T Does not depend on T

  7. Tricky case Pattern: abcab T: x y a a b c a b a b c a b0 0 0 1 0 * ^ ^ a b c a bN N 0 N N i-L shift = 4 – 1 = 3 b b c c

  8. Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 0 3 0 0 3 0 dabcab dabdabcabdab Where to shift depends on T Does not depend on T

  9. Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b N N N N 2 N N 2 N N dabcab dabdabcabdab Where to shift depends on T Does not depend on T

  10. Algorithm KMP: Basic idea x t T z y P t t’ j i z y P t t’ In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z. For each i, let Sp’(i) = length(t)

  11. Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

  12. FSA If the next char in T is t, we go to state 3 P: aataac t a a t a a c a 0 1 2 3 4 5 6 Sp’(i) 0 1 0 0 2 0 aaat aataac All other input goes to state 0

  13. Tricky case Pattern: abcab a b c a b Failure link dummy 0 0 0 0 2 c b c a b a FSA

  14. How to actually do pre-processing? • Similar pre-processing for KMP and B-M • Find matches between a suffix and a prefix • Both can be done in linear time • P is usually short, even a more expensive pre-processing may result in a gain overall x y KMP P t t’ i j For each i, find a j. similar to DP. Start from i = 2 x y B-M P t t’ i j

  15. Fundamental pre-processing • Zi: length of longest substring starting at i that matches a prefix of P • i.e. t = t’, x ≠ y, Zi = |t| • With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = 01003100210 • How to compute Z-values in linear time? y x P t t’ zi i+zi-1 i 1

  16. Computing Z in Linear time We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. t t’ y x P k r l t t’ y x P We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. k r l 1 k-l+1

  17. Computing Z in Linear time • No char inside the box is compared twice. At most one mismatch per iteration. • Therefore, O(n). The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. Case 1: P k Case 2: y x P Zk-l+1 <= r-k+1. Zk = Zk-l+1No comparison is needed. k r l 1 k-l+1 Zk-l+1 > r-k+1. Zk = Zk-l+1 Comparison start from r Case 3: P k r l 1 k-l+1

  18. Z-preprocessing for B-M and KMP • Both KMP and B-M preprocessing can be done in O(n) y x t t’ Z zi i j 1 j = i+zi-1 x y KMP t For each j sp’(j+zj-1) = z(j) t’ j i x y B-M t t’ Use Z backwards i j

  19. Keyword tree for spell checking • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of word • Common prefix only need to be compared once. p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

  20. Aho-Corasick algorithm • Generalizing KMP • Create failure links • Basis of the fgrep algorithm • Given the following patterns: • potato • tattoo • theater • other

  21. Failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potterisapersonwhomakespottery

  22. Failure link 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 O(n) preprocessing, and O(m+k) searching. k is # of occurrence. Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

  23. A problem with failure link • Patterns: {potato, other, pot} 0 p t h o e t r 3 a 2 t o 1

  24. A problem with failure link for multiple patterns • Patterns: {potato, other, pot, the, he, era} h e e 5 0 r p a t t h o h e t r e 2 a 3 4 t o potherarac 1

  25. Output link • Patterns: {potato, other, pot, the} Failure link: taken when a mismatch occurs. Output link: always taken. (but will return). h e e 5 0 r p a t t h o h e t r e 2 a 3 4 t o potherarac 1

  26. Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Karp-Rabin: small pattern, small alphabet • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

  27. Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Create an internal node only when there is a branch

  28. Suffix tree implementation • Explicitly labeling seq end • T: xabxa T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3

  29. Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3

  30. Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b xabcf a b c f c d d e e f f g g h h i i j j

  31. Suffix tree construction 1234567890...acatgacatt... 1:$ 1

  32. Suffix tree construction 1234567890...acatgacatt... 2:$ 1:$ 1 2

  33. Suffix tree construction 1234567890...acatgacatt... a 2:$ 2:$ 4:$ 3 1 2

  34. Suffix tree construction 1234567890...acatgacatt... a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

  35. Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

  36. Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a 4:$ c a 2:$ 4:$ t 4 t 5:$ $ 3 6 1 2

  37. Suffix tree construction • With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat” 5:$ 1234567890...acatgacatt... 5 a c 4:$ a c t a 4:$ t 4 t 5:$ 5:$ t $ 7 3 6 1 2

  38. Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a c 4:$ a c t t a t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

  39. Suffix tree construction 5:$ 1234567890...acatgacatt... 5 t a c a 5:$ t c t t a 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

  40. Suffix tree construction 5:$ 1234567890...acatgacatt... 5 t a $ c 10 a 5:$ t c t t a 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

  41. ST Application: pattern matching • Find all occurrence of P=xa in T • Find node v in the ST that matches to P • Traverse the subtree rooted at v to get the locations x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 T: xabxac O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence.

  42. ST application: repeats finding • Genome contains many repeated DNA sequences • Repeat sequence length: Varies from 1 nucleotide to whole gene • Highly repetitive DNA in some non-coding regions • 6 to 10bp x 100,000 to 1,000,000 times • Genes may have multiple copies (50 to 10,000)

  43. Find longest repeated substring • Do a tree traversal, compute the lengths of labels at each node • O(m) 2:5 L = 4 15:18 6:10 L = 9 L = 8

  44. Repeats finding • Find all repeats that are at least k-residue long and appear at least p times in the seq • Phase 1: top-down, count lengths of labels at each node • Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

  45. Repeats finding • Find repeats with at least 3 bases and 2 occurrence • cat • acat • aca 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2

  46. Repeats finding • Left-maximal repeat • S[i+1..i+k] = S[j+1..j+k] • S[i] != S[j] • Right-maximal repeat • S[i+1..i+k] = S[j+1..j+k], • S[i+k+1] != S[j+k+1] • Maximal repeat • S[i+1..i+k] = S[j+1..j+k] • S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt • aca • cat • acat

  47. Repeats finding • How to find maximal repeat? • A right-maximal repeats with different left chars 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2 Left char = [] g c c a a

  48. ST application: word enumeration • Find all k-mers that occur at least p times • Compute (L, N) for each node • Find nodes v with L>=k, and L(parent)<k, and N>=y • Traverse sub-tree rooted at v to get the locations L<k L=k L = K L>=k, N>=p This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

  49. Joint Suffix Tree • Build a ST for many than two strings • Two strings S1 and S2 • S* = S1& S2 • Build a suffix tree for S* in time O(|S1| + |S2|) • The separator will only appear in the edge ending in a leaf

  50. S1 = abcd • S2 = abca • S* = abcd&abca$ & a b c d a d & c b c d & a b c a a b b c c d $ d d & a & a a 2,4 a b 1,4 a c 2,3 a b 2,1 c 2,2 d 1,1 1,3 1,2

More Related