1 / 48

# Applications - PowerPoint PPT Presentation

Applications. Exact string and substring matching Longest common substrings Finding and representing repeated substrings efficiently Applications that lead to alternative, space efficient implementations Matching statistics Suffix Arrays. String and substrings. Exact String matching:

Related searches for Applications

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Applications' - Albert_Lan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Exact string and substring matching

• Longest common substrings

• Finding and representing repeated substrings efficiently

• Applications that lead to alternative, space efficient implementations

• Matching statistics

• Suffix Arrays

• Exact String matching:

• Input

• Pattern P of length n

• Text T of length m

• Output

• Position of all occurrences of P in T

• Solution method

• Preprocess to create suffix tree for T

• O(m) time, O(m) space

• Maximally match P in suffix tree

• Output all leaf positions below match point

• O(n+k) time where k is number of matches

• Exact set matching:

• Input

• Set of patterns {Pi} of total length n

• Text T of length m

• Output

• Position of all occurrences of each pattern Pi in T

• Solution method

• Preprocess to create suffix tree for T

• O(m) time, O(m) space

• Maximally match each Pi in suffix tree

• Output all leaf positions below match point

• O(n+k) time where k is number of total matches

• Aho-Corasick

• O(n) preprocess time and space

• to build keyword tree of set of patterns P

• O(m+k) search time

• Suffix Tree Approach

• O(m) preprocess time and space

• to build suffix tree of T

• O(n+k) search time

• Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

• Substring problem:

• Input

• Set of patterns {Pi} of total length n

• Text T of length m (m < n now)

• Output

• Position of all occurrences of T in each pattern Pi

• Solution method

• Preprocess to create generalized suffix tree for {Pi}

• O(n) time, O(n) space

• Maximally match T in generalized suffix tree

• Output all leaf positions below match point

• O(m+k) time where k is number of total matches

• Longest Common Substring problem:

• Input

• Strings S and T

• Output

• longest common substring of S and T (and position in S and T)

• Solution method

• Preprocess to create generalized suffix tree for {S,T}

• Mark each node by whether or not its subtree contains a leaf node of S, T, or both

• Simple postfix tree traversal algorithm to do this

• Path label of node with greatest string depth is the longest common substring of S and T

• Common substrings of length k problem:

• Input

• Strings S and T

• Integer k

• Output

• all substrings of S and T (and position in S and T) of length at least k

• Solution method

• Same as previous problem

• Look for all nodes with 2 leaf labels of string depth at least k

• Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest substring common to at least j of the K strings

• Example: {sanddollar, sandlot, handler, grand, pantry}

• j l(j) one string

• 2 4 sand

• 3 3 and

• 4 3 and

• 5 2 an

• Longest common substrings of >2 strings:

• Input

• Strings S1, …, SK (total length n)

• Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Solution

• Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once

• Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once

• C(v): number of distinct leaf labels in subtree rooted at node v

• Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings

• Computing C(v) efficiently

• # of leaves is not correct as some leaves may have same label

• length K bit vector, 1 bit per string in set

• OR your way up the tree

• Each OR op takes O(K) time which give O(Kn) running time

• Can be improved to be O(n) later

• Given a single string S

• Definitions

• maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b.

• Add unique characters to front and end of S to include prefixes and suffixes.

• Representation: (p1, p2, n’)

• starting positions and length of the maximal pair

• R(S) is the set of all triples representing maximal pairs in S

• S = xabcyiiizabcqabcyrxar

• 123456789012345678901

• (2, 10, 3) is a maximal pair

• (10, 14, 3) is a maximal pair

• (2, 14, 3) is not a maximal pair

• (2, 14, 4) is a maximal pair

• Note positions 2 and 14 are the start positions of two distinct maximal pairs

• A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S

• R’(S) is the set of maximal repeats

• Previous example

• abc and abcy are maximal repeats of S

• abc is represented only once

• |R’(S)| is smaller than R(S) as abc shows up twice in the second set but only once in the first set

• A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S

• Previous example

• abcy is a supermaximal repeat of S

• abc is NOT a supermaximal repeat of S

• Maximal repeats

• Input

• String S (length n)

• Output

• R’(S)

• Construct suffix tree for S

• Lemma

• If a is a maximal repeat in S, then a is the path-label of an internal node v in T

• a does not end in the middle of an edge

• (captures next character aftera is distinct)

• Corollary

• There are at most n maximal repeats

• n leaves

• all internal nodes except the root have at least two children

• therefore, at most n internal nodes

• Definitions

• Character S(i-1) is the left character of i

• The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf

• A node v of T is called left diverse if at least 2 leaves in v’s subtree have different left characters

• Theorem

• String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse

• Capture that character beforea is different

• Bottom up procedure

• All nodes will have a left character label

• Leaf node:

• Label leaves with their left character

• Internal node v:

• If any child is left diverse, so is v

• If two children have different left character labels, v is left diverse

• Otherwise, take on left character value of children

• Compact representation

• There is a compact tree T that consists only of left diverse nodes that represents all maximal repeats

• Supermaximal repeats

• Input

• String S (length n)

• Output

• The set of supermaximal repeats of S

• Key property

• A left diverse node v represents a supermaximal repeat if and only if all of v’s children are leaves, and each has a distinct left character

• Prove this

• Setting

• Text T of length m

• Pattern P of length n

• Definition

• For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring beginning at T(i) that matches a substring somewhere in P

• With matching statistics, one can solve several problems with less space than a suffix tree

• Exact matching example: P occurs at i in T if and only if ms(i) = |P|

• With matching statistics, one can solve several problems with less space than a suffix tree

• Exact matching example

• We’ll show an O(n) preprocessing time and O(m) search time solution matching the traditional methods

• Key observation: P matches substring beginning at i in T if and only if ms(i) = |P|

• Input

• Text T of length m

• Pattern P

• Output

• Compute ms(i) for 1 <=i <= m

• Compute suffix tree of P retaining suffix links

• ms(1): match T against tree

• ms(i+1) given ms(i)

• we are at some node v in the tree

• Else if it is a leaf, go up one level to parent w

• If we is an internal node, follow suffix link to s(w)

• Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v)

• Now match against T character by character till we have a mismatch and can output ms(i+1)

• p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions

• Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves

• Simply output this value when outputting ms(i) values

• Input

• strings S and T

• Output

• longest common substring of S and T

• Solution method

• Compute suffix tree for shortest string, say S

• Compute ms(i) values for T

• Maximal ms(i) value identifies LCS

• Setting

• Text T of length m

• Definition

• A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T

• Add terminating character \$ which is lexically smallest

• Example

• T = m i s s i s s i p p i

• i 1 2 3 4 5 6 7 8 9 0 1

• Pos(i) 5 4 119 3 108 2 7 6 1

• Input

• Text T of length m

• Output

• Pos array

• Solution

• Compute suffix tree of T

• Do a lexical depth-first traversal of T labeling Pos(i) with leafs in order of encountering them

• Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

• Input

• Text T of length m

• Pattern P of length n

• Output

• All occurrences of P in T

• Solution

• Compute suffix array Pos for T

• If P is in T, then all these locations will be grouped consecutively in Pos

• O(n log m) solution to matching problem

• Using binary search, find smallest index i’ such that P exactly matches the n characters of suffix Pos(i’)

• Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i)

• Let L and R denote current left and right boundaries of current search interval

• Initialization: L= 1, R = m

• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P, respectively

• Define M = ceiling((L+R)/2)

• Define mlr = min(l,r)

• Can begin comparison of Pos(M) at position mlr+1

• In practice, this is sufficient to achieve O(n + log m) search time, but worst case is W(n log m)

• Definition: Lcp(i,j) is the length of the longest common prefix of the suffixes beginning at Pos(i) and Pos(j).

• Mississippi Example

• Pos(3) = 5 (issippi)

• Pos(4) = 2 (ississippi)

• Lcp(3,4) = 4

• L, R, M, l, r defined as before

• If l=r, compare P against Pos(m) starting at position l+1 = r+1

• Suppose l > r

• If Lcp(L,M) > l, the common prefix of suffix Pos(L) and suffix Pos(M) is longer than the common prefix of P and Pos(L)

• Therefore, P agrees with suffix Pos(M) up through position l but disagrees in position l+1

• Furthermore, Pos(M) suffix is lexically smaller than P

• Update: L = M, l and r unchanged

• Suppose l > r

• If Lcp(L,M) < l, the common prefix of suffix Pos(L) and suffix Pos(M) is shorter than the common prefix of P and Pos(L)

• Therefore, P agrees with suffix Pos(M) up through position Lcp(L,M).

• The Lcp(L,M)+1 characters of P and L are lexically smaller than the corresponding character of Pos(M)

• Update: R = M, r = Lcp(L,M)

• Suppose l > r

• If Lcp(L,M) = l, the common prefix of suffix Pos(L) and suffix Pos(M) is equal to the common prefix of P and Pos(L)

• Therefore, P agrees with suffix Pos(M) up through position l and maybe even further

• Need to compare P(l+1) to corresponding position in Pos(M)

• Update: Will update R or L according to final determination of comparisons

• Since we begin at max(l,r), we never compare a matched position in P more than once

• Redundant comparisons of P are eliminated to at most once per binary search phase giving us O(n + log m)

• We want to get them in O(m) time

• However, there are potentially O(m2) different possible pairs of Lcp values

• Crucial point

• Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure

• See Figure 7.7 for an example

• Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf

• Other Lcp values

• Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)

• Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

• 1-time input

• Tree T (not necessarily a suffix tree)

• Later input

• 2 nodes, v and w, of T

• Output

• lowest common ancestor of v,w in T

• Goal

• linear preprocess time

• O(1) query time

• 1-time input

• Strings S1 and S2

• Later input

• index positions i and j

• Output

• length of longest substring of S1 beginning at i that matches substring of S2 beginning at j

• Goal

• linear preprocess time

• O(1) query time

S1

i

• Relationship to longest common substring

• Similar, but now start positions are fixed

S2

j

• Linear Preprocessing

• Create general suffix tree for S1 and S2

• Compute string depth at each node

• Process tree to allow for constant time LCA queries

• Establish pointers to all leaf nodes in tree

• Constant time query processing

• Find u = lca(v,w)

• Output string depth of u

• Linear Preprocessing (Assume |S2| < |S1|)

• Create general suffix tree for S2

• Compute matching statistic ms(i) and p(i) for S1

• length of longest match of substring starting at i in S1 with some substring in S2

• p(i) is the starting point of a location in S2 that matches

• Process tree to allow for constant time LCA queries

• Establish pointers to all leaf nodes in tree

• Constant time query processing

• Find u = lca(p(v), w) in suffix tree for S2

• Output min(ms(v), string depth of u)

• why is this correct?

• Maximal Palindromes

• Input

• String S

• Output

• Location of all maximal palindromes in S

• Solution

• Longest common extensions of specific pairs of positions in S and Sr

• O(S) solution

• Longest common substrings of >2 strings:

• Input

• Strings S1, …, SK (total length n)

• Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Problem with previous solution

• O(kn) time to compute C(v) values

• C(v): number of distinct leaf labels in subtree rooted at node v

• S(v): total number of leaves in v’s subtree

• U(v): number of “duplicate suffixes” from same string that occur in v’s subtree

• C(v) = S(v) - U(v)

• ni(v) = number of leaves with identifier i in the subtree rooted at node v

• ni = total number of leaves with identifier i

• Definitions

• S(v): total number of leaves in v’s subtree

• U(v): number of “duplicate suffixes” from same string that occur in v’s subtree

• ni(v) = number of leaves with identifier i in the subtree rooted at node v

• ni = total number of leaves with identifier i

• Observations

• U(v) = S max((ni(v) - 1), 0)

• C(v) = S(v) - U(v)

• Computing U(v) values

• DF traversal of tree numbering leaves in order that they are encountered

• For each string label i

• Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers

• Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li

• For each node v, let h(v) denote the number of times it is the lca computed from step above

• Key property

• ni(v) = Si h(w) where w is in v’s subtree

• Computing U(v) values

• DF traversal of tree numbering leaves in order that they are encountered

• Set h(v) to 0 for all nodes v

• For each string label i

• Compute lca v of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li

• Increment h(v) by 1

• Propagate h(v) values up the tree by addition to set U(v)

• Set C(v) = S(v) - U(v)