Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods Lecturer: Dr. Rose Slides by: Dr. Rose February 20, 2003

Overview • Exclusion methods: fast expected time O(m) • Partition approaches: • BYP algorithm • Aho-Corasick exact matching algorithm • Keyword trees • Back to Aho-Corasick exact matching algorithm • Algorithm for computing failure links • Back to BYP algorithm

Exclusion Methods Q: Can we improve on the Q(km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm < Q(km)  the worst case may not be better than Q(km)

Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let a = |S|, where S is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases • Partition phase • Search Phase • Check Phase

Exclusion Methods • Partition phase • Partition either T or P into r-length regions (depends on particular algorithm) • Search Phase • Use exact matching to search T for r-length intervals • These are potential targets for approximate occurrences of P. • Eliminate as many intervals as possible. • Check Phase • Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P &T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

BYP Method BYP Algorithm: • Let P be the set of the first k+1 substrings of P’s partitioning. • Build a keyword tree for the set of patterns P. • Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. • ….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: • Each edge is labeled with one character • Any two edges out of the same node have distinct labels. • Every pattern Pi in P maps to some node v of K s.t. the path from the root to v spells out Pi • Every leaf in K is mapped by some pattern in P.

Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. • Every node corresponds to a prefix of a pattern in P. • Conversely, every prefix of a pattern maps to a node in K.

Keyword Trees (section 3.4) • If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed S. • Let Ki denote the partial keyword tree that encodes patterns P1,.. Pi of P.

Keyword Trees (section 3.4) • Consider partial keyword tree K1 • comprised of a single path of |P1| edges out of root r. • Each edge is labeled with one character of P1 • Reading from the root to the leaf spells out P1 • The leaf is labeled 1

Keyword Trees (section 3.4) Creating K2 from K1: • Find the longest path from the root of K1 that matches a prefix of P2. • This paths ends by • Either exhausting the characters of P2 or • Ending at some existing node v in K1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P2.

Keyword Trees (section 3.4) Example: P1 is potato • P2 is pot • P2 is potty

Keyword Trees (section 3.4) Use of keyword trees for matching • Finding occurrences of patterns in P that occur starting at position l in T: • Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. • Numbered nodes along this path indicate matched patterns in P that start at position l. • This takes time proportional to min(n, m) • Traversing K for each position l in T gives O(nm) • This can be improved!

Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison. • Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

Keyword Tree Speedup First, we assume Pi Pj for all combinations Pi,Pj in P. (This assumption will be relaxed later in the full Aho-Corasick Alg.) Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L(v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L(v) that is a prefix of some pattern in P.

Keyword Tree Speedup Example: L(v) = potat, lp(v) = 2, the suffix at is the prefix of P4.

Keyword Tree Speedup Note: if a is the lp(v)-length suffix of L(v), then there is a unique node labeled a. Example: at is the lp(v)-length suffix of L(v), w is the unique node labeled at.

Keyword Tree Speedup Defn: For node v of K let nv be the unique node in K labeled with the suffix of L(v) of length lp(v). When lp(v) = 0 then nv is the root of K. Defn: The ordered pair (v,nv) is called a failure link. Example:

Aho-Corasick (section 3.4.6) Algorithm AC search (we assume Pi Pj for all combinations Pi,Pj in P.) l = 1; c = 1; w = root of K; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that Pi occurs in T starting at position l; w= w´ and c = c + 1; } w = nw and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2. l = c – lp(v) = 9 – 2 = 7

Computing nv in Linear Time • Note: if v is the root r or 1 character away from r, then nv = r. • Imagine nv has been computed for for every node that is exactly k or fewer edges from r. • How can we compute nv for v, a node k+1 edges from r? (We also want L(nv).)

Computing nv in Linear Time • We are looking for nv and L(nv). • Let v´ be the parent of v in K and x the character on the edge connecting them. • nv´is known since v´ is k edges from r. • Clearly, L(nv) must be a suffix of L(nv´) followed by x. • First check if there is an edge (nv´,w´) with label x. • If so, then nv = w´. • O/w L(nv) is a proper suffix of L(nv´) followed by x. • Examine nnv´ for an outgoing edge labeled x. • If no joy, keep repeating, finally setting nv = r, if we run out of edges.

Computing nv in Linear Time Algorithm nv /* Initialization */ v´ = parent(v) in K and x the character on the edge (v´,v) w = nv´ /* computation */ While (( edge labeled x out of node w) & (w r)) w = nw if ( edge (w,w´) label x) nv = w´ else nv = r

Computing nv in Linear Time Thm. Alg. nv takes O(n) when applied to all nodes in K, where n is the length of all patterns in K. Q: How can we demonstrate this? Consider pattern P in P, where t is the length of P. • Since lp(v)  lp(v´) + 1  lp() is increased by at most t. • But the assignment (w = nw) decreases lp(). If w is assigned k times then lp(v)  lp(v´) – k.  Since lp() is never negative, the assignment (w = nw) is bound by t.

Computing nv in Linear Time Thm. Cont. • The total time is proportional to the loop: While (( edge labeled x out of node w) & (w r)) w = nw Since the loops is bound by t, the length of P, • all failure links on the path for P are set in O(t) time. We can apply the same logic to the other patterns in P to yield a linear computation for all failure links.

Aho-Corasick (section 3.4.6) Relaxing the substring assumptioni.e., Pi Pj for all combinations Pi,Pj in P. Consider: P = {acatt, ca}, T = acatg T is matched along K until the letter g is reached. For the current node v, L(v) = acat. There is no edge labeled g out of v. No proper suffix of acat is a prefix of acatt or ca • Therefore nv is the root. • Return to the root and set l = 5 • At this point the algorithm will fail to match g. • It does not find the occurrence of ca in T. Q: What went wrong?

Aho-Corasick (section 3.4.6) A: The problem is that the algorithm is increases l (shifts) to match the longest suffix of L(v) with a prefix of some P in P. P is not necessarily a suffix of T even if P is embedded in T. Soln: Report fully embedded patterns as they appear.

Aho-Corasick (section 3.4.6) Q: How do we find the fully embedded patterns as they appear? Lemmas 3.4.2 & 3.4.3 Lemma 3.4.2 Pattern Pi must occur in T ending at position c if node v is reached and there is a directed path of failure links in K from a node v to a node numbered with pattern i. Lemma 3.4.3 If node v is reached then pattern Pi occurs in T ending at position c only if v is numbered i or there is a directed path of failure links v to a node numbered i.

Aho-Corasick (section 3.4.6) Algorithm full AC search (No assumption that Pi Pj for all combinations Pi,Pj in P.) (Report embedded patterns as they appear.) l = 1; c = 1; w = root of K; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i or there is a directed path of failure links from w´ to a node numbered with i then report that Pi occurs in Tending at position c; w= w´ and c = c + 1; } w = nw and l = c - lp(w); } Until c > m;.

Aho-Corasick (section 3.4.6) Q: How do we recognize directed failure-link paths to pattern-numbered nodes? Idea: associate with each node its its pattern numbered node, if there is one. Q: Where should this be done? • Algorithm nv must be extended. Caveat: the time can not be more than linear!

Aho-Corasick (section 3.4.6) Idea: associate with each node its pattern-numbered node, if there is one. Mechanism: create an output link from the node to its pattern-numbered node. How: Compute the failure link to nv for node v. If nv is a pattern-numbered node v’s set output link to nv. O/w if nv has an output link to w, set v’s output link to w. O/w v has no output link. This takes O(n) time.

Aho-Corasick (section 3.4.6) Analysis: • Preprocessing of patterns in P can be done in O(n) • All occurrences in T of P P can be found in time O(m + k). • m is the length of T • k is the number of occurrences of patterns P P . Here we are counting the time to output each occurrence. Overall, we get O(n+m+k) time.

BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Lemma 12.3.1 Suppose there is a match of P &T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

BYP Method BYP Algorithm: • Let P be the set of the first k+1 substrings of P’s partitioning. • Build a keyword tree for the set of patterns P. • Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. • For each iI use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

BYP Method Q: Why do we choose the range i-n-k..i+n+k in T, i.e., T[i-n-k..i+n+k] ? • What is n? • What is k? • Why –(n + k)  (n +k) ?

BYP Method BYP considers all possible places for an occurrence of P in T. (lemma 12.3.1) Step b: Building the keyword tree takes O(n) Step c: Aho-Corasick takes O(m) (sinceO(m+k)isO(m)) Note: there are other approaches for steps b & c (pg 272) Step d: takes time O(n) for each index in I. Recall, that we are interested in expected running timeO(m). Worst case may be larger.

BYP: Expected Time From the previous slide: steps b & c are already O(m) worst case. (Why is this true?) Ananlysis of Step d: assume The size of our alphabet is s (s = |S| ) T has uniform distribution of characters P can be any arbitrary string • All pP have the same length, r. • What is the expected occurrence of an arbitrary length r substring in T if |T| = r? • A: 1/sr (see next slide for explanation)

BYP: Expected Time A: 1/sr because we assume a uniform distribution of characters in T. • The probability of any specific character is 1/s. • The probability of any sequence of k characters is 1/sk. However, |T|  r, in fact |T|  r. Q: If there are m substrings of length r in T, what is the expected number of exact occurrences of p in T? A: m/sr Q: What are the expected number of occurrences in T of patterns in P? A: m(k+1)/sr (recall there are k+1 patterns in P)

BYP: Expected Time Q: How long does it take to check for a single approximate occurrence of P in T in step d? A: dynamic programming gives O(n2) (global alignment) Expected checking time in step d is then n2m(k+1)/sr For O(m) time, we need to choose k s.t.

BYP: Expected Time To simplify, substitute n-1 for k, and solve for r in Then sr = n3/c , so r = logsn3 - logsc But we are given

BYP: Expected Time So for k = n/(log n), BYP has an expected running time of O(m) Q: Where did n/(log n) come from? A: It is a simple expression for k in terms of n that satisfies our equation. To see this, substitute n/(log n) for k into:

BYP: Expected Time Letting sr = n3/c , you get: Which simplifies:

Bioinformatics Algorithms and Data Structures