Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching Instructor: Dr. Rose January 14, 2003

Exact String Matching • Types of Abstract Problems: • Pattern matching, i.e., find pattern P in string S • Similarity comparison, i.e., what is the longest common substring in S1 and S2? • Can we find P´~P in S? We can think of P´ as a mutation of P. • What are the regions of similarity in S1 and S2? We can also do this with mutations

Exact String Matching • Q: What is the underlying theme common to these abstract problems? • A: Correlation, i.e., correlation between two signals, strings, etc.

Exact String Matching • Q: What is the simplest way to compare two strings? • A: Look for a mapping of one string into the other.

Exact String Matching Given two strings S1 and S2, Where length(S1) <= length(S2) : • Start at the beginning of S1 and S2 • Compare corresponding characters, i.e., S1[1] & S2[1], S1[2] & S2[2], etc.. Continue until either: • All the characters in S1 have been matched or • A mismatch is found

Exact String Matching If there is a mismatch shift 1 character position along S2 and start over, e.g., compare S1[1]& S2[2], S1[2]& S2[3], etc.. Keep doing this until a match is found or the possible starting positions in S2 are exhausted.

Exact String Matching Example: S1 = adab, S2=abaracadabara a b a r a c a d a b a r a 1: a d d != a 2: _ a a != b 3: __ a d d != r 4: ____a a != r 5: ______a d d != c 6: _______ a a != c 7: _________ a d a b Finally!!!! Q: How many comparisons? A: 13, looks ok?

Exact String Matching Example 2: S1 = aaaaab, length(S1) = 6 S2 = aaaaaaaaaaab, length(S2) = 12 a a a a a a a a a a a b 1: a a a a a b b != a 2: _ a a a a a b b != a 3: ___a a a a a b b != a 4: ____ a a a a a b b != a 6: ______a a a a a b b != a 7: _______ a a a a a b b != a

Exact String Matching Example 2 continued from previous slide a a a a a a a a a a a b 8: ________ a a a a a b Finally!!! Q: How many comparisons were made? A: 42 = 7 X 6 = (12 – 6 + 1) X 6 = (N – M + 1) X M Where length(S2) = N and length(S1) = M Q: Where did this come from? A: There are N – M + 1 possible match positions in S2

Exact String Matching Bottom line, the time complexity is Q(NM) Observation: Notice that many of the characters in S2 are involved in multiple comparisons. WHY??? A: Because the naïve approach doesn’t learn from previous matches. By the time the first mismatch occurs, we know what the first 6 characters of S1 and S2 are.

Exact String Matching Note: A smarter approach would not involve the first 6 characters of S2 in subsequent comparisons. Fast matching algorithms take advantage of this insight. Q: Where does this insight come from? A: Preprocessing either S1 or S2.

Exact String Matching Insight: if a match fails 1) don’t make redundant comparisons 2) skip to the first next possible match position. Note: the next possible match position may not be 1 character shift away. Let’s consider both of these ideas with respect to examples 1 and 2

Exact String Matching Let’s review example 2: a a a a a a a a a a a b S2 1: a a a a a b  S1 b != a, we have seen the first 6 characters 2: _a a a a a b b != a, we already know the a’s match, we only need to try to match the ‘b’ 3: ___a a a a a b b != a, ditto 4: ____ a a a a a b b != a, ditto 6: ______a a a a a b b != a, ditto 7: _______ a a a a a b b != a, ditto 8: _________a a a a a b Finally!!! The number of comparisons is 12 instead of the previous 42

Exact String Matching Let’s review example 1: S1 = adab, S2=abaracadabara a b a r a c a d a b a r a 1:a d d != b, we have seen the first 2 characters The next possible match must be at least two positions away 2: __ a d d != r, we have seen the first 4 chars of S2 The next possible match must be at least two positions away 3: _____ a d d != c, we have seen the first 6 chars of S2 The next possible match must be at least two positions away 4: _________ a d a b Finally!!!! Q: How many comparisons? A: 10. The previous approach took 13 comparisons

Preprocessing a String Core Idea For each position i>1 in string S, identify the maximal substring that matches a prefix of S. Q: Why do we want to do this? A: We will use this information in two ways: 1) This tells us how far to skip for the next possible match. (Recall example 1) 2) Knowledge of prefix matches allows us to avoid redundant comparisons (Recall example 2) Do we need to go back and review examples 1 and 2?

Preprocessing a String Let M(Si) denote the maximal substring that matches a prefix of S at position i>1 Example: S = aabcaabxaaz (from textbook) M(S2) = a M(S3) = Ø M(S4) = Ø M(S5) = aab

Preprocessing a String Let Z(Si) denote the length of the maximal substring M(Si) starting from position i>1 that matches a prefix of S Example: S = aabcaabxaaz (from textbook) Z(S2) = 1, since M(S2) = a Z(S3) = 0, since M(S3) = Ø Z(S4) = 0, since M(S4) = Ø Z(S5) = 3, since M(S5) = aab

Preprocessing a String Consider the figure above, depicting string S and two maximal substrings a and b from positions j and k, respectively that match prefixes of S. Zj is the length of a, and Zk is the length ofb. Gusfield refers to these boxes as Z-boxes.

Preprocessing a String Let’s look at a concrete instance of this abstraction

Preprocessing a String For all i>1, ri denotes the right-most endpoint of the Z-boxes containing i. Note that while i is in both a and b, the rightmost endpoint of these Z-boxes is the endpoint of a.

Preprocessing a String Let’s compare the abstract depiction with our concrete example.

Preprocessing a String li is the left end of the Z-box ending at ri.

Preprocessing a String Again, compare the abstract with the concrete.

Preprocessing a String • We will now consider how to find the Z-boxes in linear time, O(|S|). • We can use this to find exact matches in linear time.

Preprocessing a String • We start by computing Z2, explicitly comparing characters S[1]&S[2], etc. • If Z2 > 0, then let r = r2 and l = l2= 2, o/w let r = 0 and l = 0.

Preprocessing a String • Iterate to compute all subsequent Zk. • When Zk is computed, all previous Zi, 1< i <= k-1 are already known.

The Z Algorithm • If k > r, then k is not in any Z-box that has yet been found. • We must find Zk by comparing characters starting at position k with characters starting at position 1 in S.

The Z Algorithm • If k <= r, then k is contained in a previously found Z-box, say a. • Then the substring b from k to r matches a substring of a from k´ to Zl.

The Z Algorithm Here is a concrete example where k <= r. We see that k is contained in a previously found Z-box a. The substring b from k to r matches a substring of a from k´ to Zl.

The Z Algorithm • We need to check if the value of Zk´is nonzero. • Why? • If Zk´is nonzero, then there is a prefix of S starting k´. This means that k must also be the start of a prefix of S.

The Z Algorithm • Here is a concrete example where the value of Zk´is zero. • The substring starting at k´ is not a prefix of S, nor is the substring at k.

The Z Algorithm If Zk´is nonzero, how long is the prefix starting at k? • Minimally, it is at least as long as the smaller of Zk´ and |b|. • Of course it may be longer.

The Z Algorithm • The prefix starting at k is at least the smaller of Zk´ and |b|. Case 2a: If Zk´ < |b|, then its length is exactly Zk´ as depicted in the figure below. In this case, r and l remain unchanged.

The Z Algorithm Here is a concrete example where Zk´ < |b|. In this case, 3 < 6. The length of the prefix starting at k is exactly Zk´ , i.e., 3. In this case, r and l remain unchanged.

The Z Algorithm Case 2b: If Zk´ >= |b|, then b is a prefix of S as depicted in the figure below. • It could be the case that Zk> |b|. This can only be determined by extending the match past r.

The Z Algorithm Here is a concrete example where Zk´ = |b|, i.e., 3 = 3. We can see that b is a prefix of S.

The Z Algorithm Here is a concrete example where Zk´ > |b|. We can see that b is a prefix of S and so is this longer substring starting at k. Only by extending the match past r are we able to distinguish between Zk´ = |b| and Zk´ > |b|.

The Z Algorithm • In extending the match past r, say a mismatch occurs at q, q >= r + 1. • Set Zk= q – k, r = q – 1, and l = k as shown in the figure below.

The Z Algorithm • Using our concrete example: In extending the match past r, a mismatch occurs at q, q = r + 2. • Set Zk= q – k, r = q – 1, and l = k.

The Z Algorithm • Continue to iterate through the entire string. • Computing subsequent Zk will entail only the cases we discussed: • Case 1: k > r, k is not in a known Z-box. • Find Zk by explicitly matching with the start of S. Set r & l accordingly. • Case 2a: Zk’ < |b|, the prefix at k is wholly contained in b. • r & l are not changed. • Case 2b: Zk’ >= |b|, b is a prefix of S. • Try to extend the match. Set l = k and r = q – 1, where q is the position of the first mismatch.

The Z Algorithm • Theorem 1.4.1 Using algorithm Z, value Zk is correctly computed and variables r and l are correctly updated. Proof on page 9 of text. • Theorem 1.4.2 All the Zk(S) values are computed by algorithm Z in O(|S|), i.e., linear time. Proof on page 10 of text.

A Simple Linear-Time Exact String Matching Algorithm • We can use algorithm Z by itself as a simple linear-time string matching algorithm. • Let S = P$T where: • T is the target string, |T| = m • P is the pattern string, |P| = n, n <= m • $ is character not appearing in either P or T. • Apply algorithm Z to S.

A Simple Linear-Time Exact String Matching Algorithm • Since $ does not appear in P or T, no prefix of S can be longer than n, i.e., |P|. • We only need to consider Zi(S) for i in T, i.e., i > n + 1 • Any value of i, such that i > n + 1, where Zi(S) = n, indicates a match of P at position i – (n+1) in T. • All Zi(S) are computed in O(m+n) = O(m)

Bioinformatics Algorithms and Data Structures