CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms

Boyer – Moore algorithm • Three ideas: • Right-to-left comparison • Bad character rule • Good suffix rule

Boyer – Moore algorithm • Right to left comparison x y Skip some chars without missing any occurrence. y

Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position i = 5 5 – 3 = 2. so shift 2 P: tpabxab Restart the comparison here. Preprocessing O(n)

(Strong) good suffix rule x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ z y P t’ t z ≠ y z y P t t’ x t T z z y P t’ t’ t z z y P t’ t’ t

Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 2 0 0 2 0 0 dabcab dabdabcabdab Where to shift depends on T Does not depend on T

Tricky case Pattern: abcab T: x y a a b c a b a b c a b0 0 0 1 0 * ^ ^ a b c a bN N 0 N N i-L shift = 4 – 1 = 3 b b c c

Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 0 3 0 0 3 0 dabcab dabdabcabdab Where to shift depends on T Does not depend on T

Example preprocessing qcabdabdab Bad char rule Good suffix rule 1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b N N N N 2 N N 2 N N dabcab dabdabcabdab Where to shift depends on T Does not depend on T

Algorithm KMP: Basic idea x t T z y P t t’ j i z y P t t’ In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z. For each i, let Sp’(i) = length(t)

Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

FSA If the next char in T is t, we go to state 3 P: aataac t a a t a a c a 0 1 2 3 4 5 6 Sp’(i) 0 1 0 0 2 0 aaat aataac All other input goes to state 0

Tricky case Pattern: abcab a b c a b Failure link dummy 0 0 0 0 2 c b c a b a FSA

How to actually do pre-processing? • Similar pre-processing for KMP and B-M • Find matches between a suffix and a prefix • Both can be done in linear time • P is usually short, even a more expensive pre-processing may result in a gain overall x y KMP P t t’ i j For each i, find a j. similar to DP. Start from i = 2 x y B-M P t t’ i j

Fundamental pre-processing • Zi: length of longest substring starting at i that matches a prefix of P • i.e. t = t’, x ≠ y, Zi = |t| • With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = 01003100210 • How to compute Z-values in linear time? y x P t t’ zi i+zi-1 i 1

Computing Z in Linear time We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. t t’ y x P k r l t t’ y x P We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. k r l 1 k-l+1

Computing Z in Linear time • No char inside the box is compared twice. At most one mismatch per iteration. • Therefore, O(n). The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. Case 1: P k Case 2: y x P Zk-l+1 <= r-k+1. Zk = Zk-l+1No comparison is needed. k r l 1 k-l+1 Zk-l+1 > r-k+1. Zk = Zk-l+1 Comparison start from r Case 3: P k r l 1 k-l+1

Z-preprocessing for B-M and KMP • Both KMP and B-M preprocessing can be done in O(n) y x t t’ Z zi i j 1 j = i+zi-1 x y KMP t For each j sp’(j+zj-1) = z(j) t’ j i x y B-M t t’ Use Z backwards i j

Keyword tree for spell checking • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of word • Common prefix only need to be compared once. p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

Aho-Corasick algorithm • Generalizing KMP • Create failure links • Basis of the fgrep algorithm • Given the following patterns: • potato • tattoo • theater • other

Failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potterisapersonwhomakespottery

Failure link 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 O(n) preprocessing, and O(m+k) searching. k is # of occurrence. Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

A problem with failure link • Patterns: {potato, other, pot} 0 p t h o e t r 3 a 2 t o 1

A problem with failure link for multiple patterns • Patterns: {potato, other, pot, the, he, era} h e e 5 0 r p a t t h o h e t r e 2 a 3 4 t o potherarac 1

Output link • Patterns: {potato, other, pot, the} Failure link: taken when a mismatch occurs. Output link: always taken. (but will return). h e e 5 0 r p a t t h o h e t r e 2 a 3 4 t o potherarac 1

Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Karp-Rabin: small pattern, small alphabet • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Create an internal node only when there is a branch

Suffix tree implementation • Explicitly labeling seq end • T: xabxa T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3

Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3

Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b xabcf a b c f c d d e e f f g g h h i i j j

Suffix tree construction 1234567890...acatgacatt... 1:$ 1

Suffix tree construction 1234567890...acatgacatt... 2:$ 1:$ 1 2

Suffix tree construction 1234567890...acatgacatt... a 2:$ 2:$ 4:$ 3 1 2

Suffix tree construction 1234567890...acatgacatt... a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a 4:$ c a 2:$ 4:$ t 4 t 5:$ $ 3 6 1 2

Suffix tree construction • With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat” 5:$ 1234567890...acatgacatt... 5 a c 4:$ a c t a 4:$ t 4 t 5:$ 5:$ t $ 7 3 6 1 2

Suffix tree construction 5:$ 1234567890...acatgacatt... 5 a c 4:$ a c t t a t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

Suffix tree construction 5:$ 1234567890...acatgacatt... 5 t a c a 5:$ t c t t a 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

Suffix tree construction 5:$ 1234567890...acatgacatt... 5 t a $ c 10 a 5:$ t c t t a 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

ST Application: pattern matching • Find all occurrence of P=xa in T • Find node v in the ST that matches to P • Traverse the subtree rooted at v to get the locations x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 T: xabxac O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence.

ST application: repeats finding • Genome contains many repeated DNA sequences • Repeat sequence length: Varies from 1 nucleotide to whole gene • Highly repetitive DNA in some non-coding regions • 6 to 10bp x 100,000 to 1,000,000 times • Genes may have multiple copies (50 to 10,000)

Find longest repeated substring • Do a tree traversal, compute the lengths of labels at each node • O(m) 2:5 L = 4 15:18 6:10 L = 9 L = 8

Repeats finding • Find all repeats that are at least k-residue long and appear at least p times in the seq • Phase 1: top-down, count lengths of labels at each node • Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

Repeats finding • Find repeats with at least 3 bases and 2 occurrence • cat • acat • aca 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2

Repeats finding • Left-maximal repeat • S[i+1..i+k] = S[j+1..j+k] • S[i] != S[j] • Right-maximal repeat • S[i+1..i+k] = S[j+1..j+k], • S[i+k+1] != S[j+k+1] • Maximal repeat • S[i+1..i+k] = S[j+1..j+k] • S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt • aca • cat • acat

Repeats finding • How to find maximal repeat? • A right-maximal repeats with different left chars 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2 Left char = [] g c c a a

ST application: word enumeration • Find all k-mers that occur at least p times • Compute (L, N) for each node • Find nodes v with L>=k, and L(parent)<k, and N>=y • Traverse sub-tree rooted at v to get the locations L<k L=k L = K L>=k, N>=p This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Joint Suffix Tree • Build a ST for many than two strings • Two strings S1 and S2 • S* = S1& S2 • Build a suffix tree for S* in time O(|S1| + |S2|) • The separator will only appear in the edge ending in a leaf

S1 = abcd • S2 = abca • S* = abcd&abca$ & a b c d a d & c b c d & a b c a a b b c c d $ d d & a & a a 2,4 a b 1,4 a c 2,3 a b 2,1 c 2,2 d 1,1 1,3 1,2

CS5263 Bioinformatics