Applications

Applications • Exact string and substring matching • Longest common substrings • Finding and representing repeated substrings efficiently • Applications that lead to alternative, space efficient implementations • Matching statistics • Suffix Arrays

String and substrings • Exact String matching: • Input • Pattern P of length n • Text T of length m • Output • Position of all occurrences of P in T • Solution method • Preprocess to create suffix tree for T • O(m) time, O(m) space • Maximally match P in suffix tree • Output all leaf positions below match point • O(n+k) time where k is number of matches

String and substrings • Exact set matching: • Input • Set of patterns {Pi} of total length n • Text T of length m • Output • Position of all occurrences of each pattern Pi in T • Solution method • Preprocess to create suffix tree for T • O(m) time, O(m) space • Maximally match each Pi in suffix tree • Output all leaf positions below match point • O(n+k) time where k is number of total matches

Comparison with Aho-Corasick • Aho-Corasick • O(n) preprocess time and space • to build keyword tree of set of patterns P • O(m+k) search time • Suffix Tree Approach • O(m) preprocess time and space • to build suffix tree of T • O(n+k) search time • Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

String and substrings • Substring problem: • Input • Set of patterns {Pi} of total length n • Text T of length m (m < n now) • Output • Position of all occurrences of T in each pattern Pi • Solution method • Preprocess to create generalized suffix tree for {Pi} • O(n) time, O(n) space • Maximally match T in generalized suffix tree • Output all leaf positions below match point • O(m+k) time where k is number of total matches

Common Substrings • Longest Common Substring problem: • Input • Strings S and T • Output • longest common substring of S and T (and position in S and T) • Solution method • Preprocess to create generalized suffix tree for {S,T} • Mark each node by whether or not its subtree contains a leaf node of S, T, or both • Simple postfix tree traversal algorithm to do this • Path label of node with greatest string depth is the longest common substring of S and T

Common Substrings • Common substrings of length k problem: • Input • Strings S and T • Integer k • Output • all substrings of S and T (and position in S and T) of length at least k • Solution method • Same as previous problem • Look for all nodes with 2 leaf labels of string depth at least k

Longest Common Substrings of >2 Strings • Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest substring common to at least j of the K strings • Example: {sanddollar, sandlot, handler, grand, pantry} • j l(j) one string • 2 4 sand • 3 3 and • 4 3 and • 5 2 an

Problem definition and solution • Longest common substrings of >2 strings: • Input • Strings S1, …, SK (total length n) • Output • l(j) (and pointers to substrings) for 2 <= j <= K • Solution • Build a generalized suffix tree for the K strings • each string has a unique end character, so each leaf shows up only once

Solution continued • Build a generalized suffix tree for the K strings • each string has a unique end character, so each leaf shows up only once • C(v): number of distinct leaf labels in subtree rooted at node v • Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings • Computing C(v) efficiently • # of leaves is not correct as some leaves may have same label • length K bit vector, 1 bit per string in set • OR your way up the tree • Each OR op takes O(K) time which give O(Kn) running time • Can be improved to be O(n) later

Repeated substrings • Given a single string S • Definitions • maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b. • Add unique characters to front and end of S to include prefixes and suffixes. • Representation: (p1, p2, n’) • starting positions and length of the maximal pair • R(S) is the set of all triples representing maximal pairs in S

Example • S = xabcyiiizabcqabcyrxar • 123456789012345678901 • (2, 10, 3) is a maximal pair • (10, 14, 3) is a maximal pair • (2, 14, 3) is not a maximal pair • (2, 14, 4) is a maximal pair • Note positions 2 and 14 are the start positions of two distinct maximal pairs

More definitions • A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S • R’(S) is the set of maximal repeats • Previous example • abc and abcy are maximal repeats of S • abc is represented only once • |R’(S)| is smaller than R(S) as abc shows up twice in the second set but only once in the first set

Even more definitions • A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S • Previous example • abcy is a supermaximal repeat of S • abc is NOT a supermaximal repeat of S

Problem definition • Maximal repeats • Input • String S (length n) • Output • R’(S)

Properties of maximal repeats • Construct suffix tree for S • Lemma • If a is a maximal repeat in S, then a is the path-label of an internal node v in T • a does not end in the middle of an edge • (captures next character aftera is distinct) • Corollary • There are at most n maximal repeats • n leaves • all internal nodes except the root have at least two children • therefore, at most n internal nodes

More properties of maximal repeats • Definitions • Character S(i-1) is the left character of i • The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf • A node v of T is called left diverse if at least 2 leaves in v’s subtree have different left characters • Theorem • String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse • Capture that character beforea is different

Identifying left diverse nodes • Bottom up procedure • All nodes will have a left character label • Leaf node: • Label leaves with their left character • Internal node v: • If any child is left diverse, so is v • If two children have different left character labels, v is left diverse • Otherwise, take on left character value of children • Compact representation • There is a compact tree T that consists only of left diverse nodes that represents all maximal repeats

Problem definition • Supermaximal repeats • Input • String S (length n) • Output • The set of supermaximal repeats of S • Key property • A left diverse node v represents a supermaximal repeat if and only if all of v’s children are leaves, and each has a distinct left character • Prove this

Matching Statistics • Setting • Text T of length m • Pattern P of length n • Definition • For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring beginning at T(i) that matches a substring somewhere in P • With matching statistics, one can solve several problems with less space than a suffix tree • Exact matching example: P occurs at i in T if and only if ms(i) = |P|

Why study matching statistics • With matching statistics, one can solve several problems with less space than a suffix tree • Exact matching example • We’ll show an O(n) preprocessing time and O(m) search time solution matching the traditional methods • Key observation: P matches substring beginning at i in T if and only if ms(i) = |P|

Construction Problem • Input • Text T of length m • Pattern P • Output • Compute ms(i) for 1 <=i <= m

Solution • Compute suffix tree of P retaining suffix links • ms(1): match T against tree • ms(i+1) given ms(i) • we are at some node v in the tree • If it is internal, follow suffix link to s(v) • Else if it is a leaf, go up one level to parent w • If we is an internal node, follow suffix link to s(w) • Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v) • Now match against T character by character till we have a mismatch and can output ms(i+1)

Adding location of substring in P • p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions • Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves • Simply output this value when outputting ms(i) values

Applying matching statistics to LCS problem • Input • strings S and T • Output • longest common substring of S and T • Solution method • Compute suffix tree for shortest string, say S • Compute ms(i) values for T • Maximal ms(i) value identifies LCS

Suffix Arrays • Setting • Text T of length m • Definition • A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T • Add terminating character $ which is lexically smallest • Example • T = m i s s i s s i p p i • i 1 2 3 4 5 6 7 8 9 0 1 • Pos(i) 5 4 119 3 108 2 7 6 1

Computing Suffix Arrays • Input • Text T of length m • Output • Pos array • Solution • Compute suffix tree of T • Do a lexical depth-first traversal of T labeling Pos(i) with leafs in order of encountering them • Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

Using Suffix Arrays • Input • Text T of length m • Pattern P of length n • Output • All occurrences of P in T • Solution • Compute suffix array Pos for T

Properties of Suffix Arrays • If P is in T, then all these locations will be grouped consecutively in Pos • O(n log m) solution to matching problem • Using binary search, find smallest index i’ such that P exactly matches the n characters of suffix Pos(i’) • Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i)

Speeding up binary search • Let L and R denote current left and right boundaries of current search interval • Initialization: L= 1, R = m • Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P, respectively • Define M = ceiling((L+R)/2) • Define mlr = min(l,r) • Can begin comparison of Pos(M) at position mlr+1 • In practice, this is sufficient to achieve O(n + log m) search time, but worst case is W(n log m)

Longest common prefixes • Definition: Lcp(i,j) is the length of the longest common prefix of the suffixes beginning at Pos(i) and Pos(j). • Mississippi Example • Pos(3) = 5 (issippi) • Pos(4) = 2 (ississippi) • Lcp(3,4) = 4

Getting to max(l,r) with Lcp’s • L, R, M, l, r defined as before • If l=r, compare P against Pos(m) starting at position l+1 = r+1 • Suppose l > r • If Lcp(L,M) > l, the common prefix of suffix Pos(L) and suffix Pos(M) is longer than the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position l but disagrees in position l+1 • Furthermore, Pos(M) suffix is lexically smaller than P • Update: L = M, l and r unchanged

Getting to max(l,r) with Lcp’s • Suppose l > r • If Lcp(L,M) < l, the common prefix of suffix Pos(L) and suffix Pos(M) is shorter than the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position Lcp(L,M). • The Lcp(L,M)+1 characters of P and L are lexically smaller than the corresponding character of Pos(M) • Update: R = M, r = Lcp(L,M)

Getting to max(l,r) with Lcp’s • Suppose l > r • If Lcp(L,M) = l, the common prefix of suffix Pos(L) and suffix Pos(M) is equal to the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position l and maybe even further • Need to compare P(l+1) to corresponding position in Pos(M) • Update: Will update R or L according to final determination of comparisons

O(n + log m) bound • Since we begin at max(l,r), we never compare a matched position in P more than once • Redundant comparisons of P are eliminated to at most once per binary search phase giving us O(n + log m)

Computing Lcp values quickly • We want to get them in O(m) time • However, there are potentially O(m2) different possible pairs of Lcp values • Crucial point • Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure • See Figure 7.7 for an example

Process for needed Lcp values • Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf • Other Lcp values • Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1) • Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

Lowest common ancestor • 1-time input • Tree T (not necessarily a suffix tree) • Later input • 2 nodes, v and w, of T • Output • lowest common ancestor of v,w in T • Goal • linear preprocess time • O(1) query time

Longest Common Extension • 1-time input • Strings S1 and S2 • Later input • index positions i and j • Output • length of longest substring of S1 beginning at i that matches substring of S2 beginning at j • Goal • linear preprocess time • O(1) query time

Illustration S1 i • Relationship to longest common substring • Similar, but now start positions are fixed S2 j

Solution • Linear Preprocessing • Create general suffix tree for S1 and S2 • Compute string depth at each node • Process tree to allow for constant time LCA queries • Establish pointers to all leaf nodes in tree • Constant time query processing • Find u = lca(v,w) • Output string depth of u

More space-efficient solution • Linear Preprocessing (Assume |S2| < |S1|) • Create general suffix tree for S2 • Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with some substring in S2 • p(i) is the starting point of a location in S2 that matches • Process tree to allow for constant time LCA queries • Establish pointers to all leaf nodes in tree • Constant time query processing • Find u = lca(p(v), w) in suffix tree for S2 • Output min(ms(v), string depth of u) • why is this correct?

Related Problem • Maximal Palindromes • Input • String S • Output • Location of all maximal palindromes in S • Solution • Longest common extensions of specific pairs of positions in S and Sr • O(S) solution

Common substrings revisited • Longest common substrings of >2 strings: • Input • Strings S1, …, SK (total length n) • Output • l(j) (and pointers to substrings) for 2 <= j <= K • Problem with previous solution • O(kn) time to compute C(v) values • C(v): number of distinct leaf labels in subtree rooted at node v

Definitions • S(v): total number of leaves in v’s subtree • U(v): number of “duplicate suffixes” from same string that occur in v’s subtree • C(v) = S(v) - U(v) • ni(v) = number of leaves with identifier i in the subtree rooted at node v • ni = total number of leaves with identifier i

Key Concepts • Definitions • S(v): total number of leaves in v’s subtree • U(v): number of “duplicate suffixes” from same string that occur in v’s subtree • ni(v) = number of leaves with identifier i in the subtree rooted at node v • ni = total number of leaves with identifier i • Observations • U(v) = S max((ni(v) - 1), 0) • C(v) = S(v) - U(v)

Solution • Computing U(v) values • DF traversal of tree numbering leaves in order that they are encountered • For each string label i • Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers • Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li • For each node v, let h(v) denote the number of times it is the lca computed from step above • Key property • ni(v) = Si h(w) where w is in v’s subtree

Solution • Computing U(v) values • DF traversal of tree numbering leaves in order that they are encountered • Set h(v) to 0 for all nodes v • For each string label i • Compute lca v of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li • Increment h(v) by 1 • Propagate h(v) values up the tree by addition to set U(v) • Set C(v) = S(v) - U(v)

Applications

Applications

Presentation Transcript

Applications!!!

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

Applications

APPLICATIONS

Applications

Applications

Applications

Applications

Applications

+ applications