1 / 48

Applications

Applications. Exact string and substring matching Longest common substrings Finding and representing repeated substrings efficiently Applications that lead to alternative, space efficient implementations Matching statistics Suffix Arrays. String and substrings. Exact String matching:

Albert_Lan
Download Presentation

Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications • Exact string and substring matching • Longest common substrings • Finding and representing repeated substrings efficiently • Applications that lead to alternative, space efficient implementations • Matching statistics • Suffix Arrays

  2. String and substrings • Exact String matching: • Input • Pattern P of length n • Text T of length m • Output • Position of all occurrences of P in T • Solution method • Preprocess to create suffix tree for T • O(m) time, O(m) space • Maximally match P in suffix tree • Output all leaf positions below match point • O(n+k) time where k is number of matches

  3. String and substrings • Exact set matching: • Input • Set of patterns {Pi} of total length n • Text T of length m • Output • Position of all occurrences of each pattern Pi in T • Solution method • Preprocess to create suffix tree for T • O(m) time, O(m) space • Maximally match each Pi in suffix tree • Output all leaf positions below match point • O(n+k) time where k is number of total matches

  4. Comparison with Aho-Corasick • Aho-Corasick • O(n) preprocess time and space • to build keyword tree of set of patterns P • O(m+k) search time • Suffix Tree Approach • O(m) preprocess time and space • to build suffix tree of T • O(n+k) search time • Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

  5. String and substrings • Substring problem: • Input • Set of patterns {Pi} of total length n • Text T of length m (m < n now) • Output • Position of all occurrences of T in each pattern Pi • Solution method • Preprocess to create generalized suffix tree for {Pi} • O(n) time, O(n) space • Maximally match T in generalized suffix tree • Output all leaf positions below match point • O(m+k) time where k is number of total matches

  6. Common Substrings • Longest Common Substring problem: • Input • Strings S and T • Output • longest common substring of S and T (and position in S and T) • Solution method • Preprocess to create generalized suffix tree for {S,T} • Mark each node by whether or not its subtree contains a leaf node of S, T, or both • Simple postfix tree traversal algorithm to do this • Path label of node with greatest string depth is the longest common substring of S and T

  7. Common Substrings • Common substrings of length k problem: • Input • Strings S and T • Integer k • Output • all substrings of S and T (and position in S and T) of length at least k • Solution method • Same as previous problem • Look for all nodes with 2 leaf labels of string depth at least k

  8. Longest Common Substrings of >2 Strings • Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest substring common to at least j of the K strings • Example: {sanddollar, sandlot, handler, grand, pantry} • j l(j) one string • 2 4 sand • 3 3 and • 4 3 and • 5 2 an

  9. Problem definition and solution • Longest common substrings of >2 strings: • Input • Strings S1, …, SK (total length n) • Output • l(j) (and pointers to substrings) for 2 <= j <= K • Solution • Build a generalized suffix tree for the K strings • each string has a unique end character, so each leaf shows up only once

  10. Solution continued • Build a generalized suffix tree for the K strings • each string has a unique end character, so each leaf shows up only once • C(v): number of distinct leaf labels in subtree rooted at node v • Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings • Computing C(v) efficiently • # of leaves is not correct as some leaves may have same label • length K bit vector, 1 bit per string in set • OR your way up the tree • Each OR op takes O(K) time which give O(Kn) running time • Can be improved to be O(n) later

  11. Repeated substrings • Given a single string S • Definitions • maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b. • Add unique characters to front and end of S to include prefixes and suffixes. • Representation: (p1, p2, n’) • starting positions and length of the maximal pair • R(S) is the set of all triples representing maximal pairs in S

  12. Example • S = xabcyiiizabcqabcyrxar • 123456789012345678901 • (2, 10, 3) is a maximal pair • (10, 14, 3) is a maximal pair • (2, 14, 3) is not a maximal pair • (2, 14, 4) is a maximal pair • Note positions 2 and 14 are the start positions of two distinct maximal pairs

  13. More definitions • A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S • R’(S) is the set of maximal repeats • Previous example • abc and abcy are maximal repeats of S • abc is represented only once • |R’(S)| is smaller than R(S) as abc shows up twice in the second set but only once in the first set

  14. Even more definitions • A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S • Previous example • abcy is a supermaximal repeat of S • abc is NOT a supermaximal repeat of S

  15. Problem definition • Maximal repeats • Input • String S (length n) • Output • R’(S)

  16. Properties of maximal repeats • Construct suffix tree for S • Lemma • If a is a maximal repeat in S, then a is the path-label of an internal node v in T • a does not end in the middle of an edge • (captures next character aftera is distinct) • Corollary • There are at most n maximal repeats • n leaves • all internal nodes except the root have at least two children • therefore, at most n internal nodes

  17. More properties of maximal repeats • Definitions • Character S(i-1) is the left character of i • The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf • A node v of T is called left diverse if at least 2 leaves in v’s subtree have different left characters • Theorem • String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse • Capture that character beforea is different

  18. Identifying left diverse nodes • Bottom up procedure • All nodes will have a left character label • Leaf node: • Label leaves with their left character • Internal node v: • If any child is left diverse, so is v • If two children have different left character labels, v is left diverse • Otherwise, take on left character value of children • Compact representation • There is a compact tree T that consists only of left diverse nodes that represents all maximal repeats

  19. Problem definition • Supermaximal repeats • Input • String S (length n) • Output • The set of supermaximal repeats of S • Key property • A left diverse node v represents a supermaximal repeat if and only if all of v’s children are leaves, and each has a distinct left character • Prove this

  20. Matching Statistics • Setting • Text T of length m • Pattern P of length n • Definition • For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring beginning at T(i) that matches a substring somewhere in P • With matching statistics, one can solve several problems with less space than a suffix tree • Exact matching example: P occurs at i in T if and only if ms(i) = |P|

  21. Why study matching statistics • With matching statistics, one can solve several problems with less space than a suffix tree • Exact matching example • We’ll show an O(n) preprocessing time and O(m) search time solution matching the traditional methods • Key observation: P matches substring beginning at i in T if and only if ms(i) = |P|

  22. Construction Problem • Input • Text T of length m • Pattern P • Output • Compute ms(i) for 1 <=i <= m

  23. Solution • Compute suffix tree of P retaining suffix links • ms(1): match T against tree • ms(i+1) given ms(i) • we are at some node v in the tree • If it is internal, follow suffix link to s(v) • Else if it is a leaf, go up one level to parent w • If we is an internal node, follow suffix link to s(w) • Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v) • Now match against T character by character till we have a mismatch and can output ms(i+1)

  24. Adding location of substring in P • p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions • Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves • Simply output this value when outputting ms(i) values

  25. Applying matching statistics to LCS problem • Input • strings S and T • Output • longest common substring of S and T • Solution method • Compute suffix tree for shortest string, say S • Compute ms(i) values for T • Maximal ms(i) value identifies LCS

  26. Suffix Arrays • Setting • Text T of length m • Definition • A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T • Add terminating character $ which is lexically smallest • Example • T = m i s s i s s i p p i • i 1 2 3 4 5 6 7 8 9 0 1 • Pos(i) 5 4 119 3 108 2 7 6 1

  27. Computing Suffix Arrays • Input • Text T of length m • Output • Pos array • Solution • Compute suffix tree of T • Do a lexical depth-first traversal of T labeling Pos(i) with leafs in order of encountering them • Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

  28. Using Suffix Arrays • Input • Text T of length m • Pattern P of length n • Output • All occurrences of P in T • Solution • Compute suffix array Pos for T

  29. Properties of Suffix Arrays • If P is in T, then all these locations will be grouped consecutively in Pos • O(n log m) solution to matching problem • Using binary search, find smallest index i’ such that P exactly matches the n characters of suffix Pos(i’) • Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i)

  30. Speeding up binary search • Let L and R denote current left and right boundaries of current search interval • Initialization: L= 1, R = m • Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P, respectively • Define M = ceiling((L+R)/2) • Define mlr = min(l,r) • Can begin comparison of Pos(M) at position mlr+1 • In practice, this is sufficient to achieve O(n + log m) search time, but worst case is W(n log m)

  31. Longest common prefixes • Definition: Lcp(i,j) is the length of the longest common prefix of the suffixes beginning at Pos(i) and Pos(j). • Mississippi Example • Pos(3) = 5 (issippi) • Pos(4) = 2 (ississippi) • Lcp(3,4) = 4

  32. Getting to max(l,r) with Lcp’s • L, R, M, l, r defined as before • If l=r, compare P against Pos(m) starting at position l+1 = r+1 • Suppose l > r • If Lcp(L,M) > l, the common prefix of suffix Pos(L) and suffix Pos(M) is longer than the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position l but disagrees in position l+1 • Furthermore, Pos(M) suffix is lexically smaller than P • Update: L = M, l and r unchanged

  33. Getting to max(l,r) with Lcp’s • Suppose l > r • If Lcp(L,M) < l, the common prefix of suffix Pos(L) and suffix Pos(M) is shorter than the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position Lcp(L,M). • The Lcp(L,M)+1 characters of P and L are lexically smaller than the corresponding character of Pos(M) • Update: R = M, r = Lcp(L,M)

  34. Getting to max(l,r) with Lcp’s • Suppose l > r • If Lcp(L,M) = l, the common prefix of suffix Pos(L) and suffix Pos(M) is equal to the common prefix of P and Pos(L) • Therefore, P agrees with suffix Pos(M) up through position l and maybe even further • Need to compare P(l+1) to corresponding position in Pos(M) • Update: Will update R or L according to final determination of comparisons

  35. O(n + log m) bound • Since we begin at max(l,r), we never compare a matched position in P more than once • Redundant comparisons of P are eliminated to at most once per binary search phase giving us O(n + log m)

  36. Computing Lcp values quickly • We want to get them in O(m) time • However, there are potentially O(m2) different possible pairs of Lcp values • Crucial point • Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure • See Figure 7.7 for an example

  37. Process for needed Lcp values • Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf • Other Lcp values • Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1) • Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

  38. Lowest common ancestor • 1-time input • Tree T (not necessarily a suffix tree) • Later input • 2 nodes, v and w, of T • Output • lowest common ancestor of v,w in T • Goal • linear preprocess time • O(1) query time

  39. Longest Common Extension • 1-time input • Strings S1 and S2 • Later input • index positions i and j • Output • length of longest substring of S1 beginning at i that matches substring of S2 beginning at j • Goal • linear preprocess time • O(1) query time

  40. Illustration S1 i • Relationship to longest common substring • Similar, but now start positions are fixed S2 j

  41. Solution • Linear Preprocessing • Create general suffix tree for S1 and S2 • Compute string depth at each node • Process tree to allow for constant time LCA queries • Establish pointers to all leaf nodes in tree • Constant time query processing • Find u = lca(v,w) • Output string depth of u

  42. More space-efficient solution • Linear Preprocessing (Assume |S2| < |S1|) • Create general suffix tree for S2 • Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with some substring in S2 • p(i) is the starting point of a location in S2 that matches • Process tree to allow for constant time LCA queries • Establish pointers to all leaf nodes in tree • Constant time query processing • Find u = lca(p(v), w) in suffix tree for S2 • Output min(ms(v), string depth of u) • why is this correct?

  43. Related Problem • Maximal Palindromes • Input • String S • Output • Location of all maximal palindromes in S • Solution • Longest common extensions of specific pairs of positions in S and Sr • O(S) solution

  44. Common substrings revisited • Longest common substrings of >2 strings: • Input • Strings S1, …, SK (total length n) • Output • l(j) (and pointers to substrings) for 2 <= j <= K • Problem with previous solution • O(kn) time to compute C(v) values • C(v): number of distinct leaf labels in subtree rooted at node v

  45. Definitions • S(v): total number of leaves in v’s subtree • U(v): number of “duplicate suffixes” from same string that occur in v’s subtree • C(v) = S(v) - U(v) • ni(v) = number of leaves with identifier i in the subtree rooted at node v • ni = total number of leaves with identifier i

  46. Key Concepts • Definitions • S(v): total number of leaves in v’s subtree • U(v): number of “duplicate suffixes” from same string that occur in v’s subtree • ni(v) = number of leaves with identifier i in the subtree rooted at node v • ni = total number of leaves with identifier i • Observations • U(v) = S max((ni(v) - 1), 0) • C(v) = S(v) - U(v)

  47. Solution • Computing U(v) values • DF traversal of tree numbering leaves in order that they are encountered • For each string label i • Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers • Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li • For each node v, let h(v) denote the number of times it is the lca computed from step above • Key property • ni(v) = Si h(w) where w is in v’s subtree

  48. Solution • Computing U(v) values • DF traversal of tree numbering leaves in order that they are encountered • Set h(v) to 0 for all nodes v • For each string label i • Compute lca v of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li • Increment h(v) by 1 • Propagate h(v) values up the tree by addition to set U(v) • Set C(v) = S(v) - U(v)

More Related