- 146 Views
- Updated On :

Download Presentation
## PowerPoint Slideshow about 'Applications' - Albert_Lan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Applications

- Exact string and substring matching
- Longest common substrings
- Finding and representing repeated substrings efficiently
- Applications that lead to alternative, space efficient implementations
- Matching statistics
- Suffix Arrays

String and substrings

- Exact String matching:
- Input
- Pattern P of length n
- Text T of length m

- Output
- Position of all occurrences of P in T

- Input
- Solution method
- Preprocess to create suffix tree for T
- O(m) time, O(m) space

- Maximally match P in suffix tree
- Output all leaf positions below match point
- O(n+k) time where k is number of matches

- Preprocess to create suffix tree for T

String and substrings

- Exact set matching:
- Input
- Set of patterns {Pi} of total length n
- Text T of length m

- Output
- Position of all occurrences of each pattern Pi in T

- Input
- Solution method
- Preprocess to create suffix tree for T
- O(m) time, O(m) space

- Maximally match each Pi in suffix tree
- Output all leaf positions below match point
- O(n+k) time where k is number of total matches

- Preprocess to create suffix tree for T

Comparison with Aho-Corasick

- Aho-Corasick
- O(n) preprocess time and space
- to build keyword tree of set of patterns P

- O(m+k) search time

- O(n) preprocess time and space
- Suffix Tree Approach
- O(m) preprocess time and space
- to build suffix tree of T

- O(n+k) search time
- Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

- O(m) preprocess time and space

String and substrings

- Substring problem:
- Input
- Set of patterns {Pi} of total length n
- Text T of length m (m < n now)

- Output
- Position of all occurrences of T in each pattern Pi

- Input
- Solution method
- Preprocess to create generalized suffix tree for {Pi}
- O(n) time, O(n) space

- Maximally match T in generalized suffix tree
- Output all leaf positions below match point
- O(m+k) time where k is number of total matches

- Preprocess to create generalized suffix tree for {Pi}

Common Substrings

- Longest Common Substring problem:
- Input
- Strings S and T

- Output
- longest common substring of S and T (and position in S and T)

- Input
- Solution method
- Preprocess to create generalized suffix tree for {S,T}
- Mark each node by whether or not its subtree contains a leaf node of S, T, or both
- Simple postfix tree traversal algorithm to do this

- Path label of node with greatest string depth is the longest common substring of S and T

Common Substrings

- Common substrings of length k problem:
- Input
- Strings S and T
- Integer k

- Output
- all substrings of S and T (and position in S and T) of length at least k

- Input
- Solution method
- Same as previous problem
- Look for all nodes with 2 leaf labels of string depth at least k

Longest Common Substrings of >2 Strings

- Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest substring common to at least j of the K strings
- Example: {sanddollar, sandlot, handler, grand, pantry}
- j l(j) one string
- 2 4 sand
- 3 3 and
- 4 3 and
- 5 2 an

Problem definition and solution

- Longest common substrings of >2 strings:
- Input
- Strings S1, …, SK (total length n)

- Output
- l(j) (and pointers to substrings) for 2 <= j <= K

- Input
- Solution
- Build a generalized suffix tree for the K strings
- each string has a unique end character, so each leaf shows up only once

- Build a generalized suffix tree for the K strings

Solution continued

- Build a generalized suffix tree for the K strings
- each string has a unique end character, so each leaf shows up only once

- C(v): number of distinct leaf labels in subtree rooted at node v
- Given C(v) values and string-depth values, do a simple traversal of tree to find these K-1 values and pointers to locations in substrings
- Computing C(v) efficiently
- # of leaves is not correct as some leaves may have same label
- length K bit vector, 1 bit per string in set
- OR your way up the tree
- Each OR op takes O(K) time which give O(Kn) running time

- Can be improved to be O(n) later

Repeated substrings

- Given a single string S
- Definitions
- maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b.
- Add unique characters to front and end of S to include prefixes and suffixes.

- Representation: (p1, p2, n’)
- starting positions and length of the maximal pair

- R(S) is the set of all triples representing maximal pairs in S

- maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b.

Example

- S = xabcyiiizabcqabcyrxar
- 123456789012345678901
- (2, 10, 3) is a maximal pair
- (10, 14, 3) is a maximal pair
- (2, 14, 3) is not a maximal pair
- (2, 14, 4) is a maximal pair

- Note positions 2 and 14 are the start positions of two distinct maximal pairs

More definitions

- A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S
- R’(S) is the set of maximal repeats
- Previous example
- abc and abcy are maximal repeats of S
- abc is represented only once
- |R’(S)| is smaller than R(S) as abc shows up twice in the second set but only once in the first set

Even more definitions

- A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S
- Previous example
- abcy is a supermaximal repeat of S
- abc is NOT a supermaximal repeat of S

Problem definition

- Maximal repeats
- Input
- String S (length n)

- Output
- R’(S)

- Input

Properties of maximal repeats

- Construct suffix tree for S
- Lemma
- If a is a maximal repeat in S, then a is the path-label of an internal node v in T
- a does not end in the middle of an edge
- (captures next character aftera is distinct)

- If a is a maximal repeat in S, then a is the path-label of an internal node v in T
- Corollary
- There are at most n maximal repeats
- n leaves
- all internal nodes except the root have at least two children
- therefore, at most n internal nodes

- There are at most n maximal repeats

More properties of maximal repeats

- Definitions
- Character S(i-1) is the left character of i
- The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf
- A node v of T is called left diverse if at least 2 leaves in v’s subtree have different left characters

- Theorem
- String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse
- Capture that character beforea is different

- String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse

Identifying left diverse nodes

- Bottom up procedure
- All nodes will have a left character label
- Leaf node:
- Label leaves with their left character

- Internal node v:
- If any child is left diverse, so is v
- If two children have different left character labels, v is left diverse
- Otherwise, take on left character value of children

- Compact representation
- There is a compact tree T that consists only of left diverse nodes that represents all maximal repeats

Problem definition

- Supermaximal repeats
- Input
- String S (length n)

- Output
- The set of supermaximal repeats of S

- Input
- Key property
- A left diverse node v represents a supermaximal repeat if and only if all of v’s children are leaves, and each has a distinct left character
- Prove this

Matching Statistics

- Setting
- Text T of length m
- Pattern P of length n

- Definition
- For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring beginning at T(i) that matches a substring somewhere in P

- With matching statistics, one can solve several problems with less space than a suffix tree
- Exact matching example: P occurs at i in T if and only if ms(i) = |P|

Why study matching statistics

- With matching statistics, one can solve several problems with less space than a suffix tree
- Exact matching example
- We’ll show an O(n) preprocessing time and O(m) search time solution matching the traditional methods
- Key observation: P matches substring beginning at i in T if and only if ms(i) = |P|

- Exact matching example

Construction Problem

- Input
- Text T of length m
- Pattern P

- Output
- Compute ms(i) for 1 <=i <= m

Solution

- Compute suffix tree of P retaining suffix links
- ms(1): match T against tree
- ms(i+1) given ms(i)
- we are at some node v in the tree
- If it is internal, follow suffix link to s(v)
- Else if it is a leaf, go up one level to parent w
- If we is an internal node, follow suffix link to s(w)
- Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v)

- Now match against T character by character till we have a mismatch and can output ms(i+1)

- we are at some node v in the tree

Adding location of substring in P

- p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions
- Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves
- Simply output this value when outputting ms(i) values

Applying matching statistics to LCS problem

- Input
- strings S and T

- Output
- longest common substring of S and T

- Solution method
- Compute suffix tree for shortest string, say S
- Compute ms(i) values for T
- Maximal ms(i) value identifies LCS

Suffix Arrays

- Setting
- Text T of length m

- Definition
- A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T
- Add terminating character $ which is lexically smallest

- A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T
- Example
- T = m i s s i s s i p p i
- i 1 2 3 4 5 6 7 8 9 0 1
- Pos(i) 5 4 119 3 108 2 7 6 1

Computing Suffix Arrays

- Input
- Text T of length m

- Output
- Pos array

- Solution
- Compute suffix tree of T
- Do a lexical depth-first traversal of T labeling Pos(i) with leafs in order of encountering them
- Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

Using Suffix Arrays

- Input
- Text T of length m
- Pattern P of length n

- Output
- All occurrences of P in T

- Solution
- Compute suffix array Pos for T

Properties of Suffix Arrays

- If P is in T, then all these locations will be grouped consecutively in Pos
- O(n log m) solution to matching problem
- Using binary search, find smallest index i’ such that P exactly matches the n characters of suffix Pos(i’)
- Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i)

Speeding up binary search

- Let L and R denote current left and right boundaries of current search interval
- Initialization: L= 1, R = m

- Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P, respectively
- Define M = ceiling((L+R)/2)
- Define mlr = min(l,r)
- Can begin comparison of Pos(M) at position mlr+1

- In practice, this is sufficient to achieve O(n + log m) search time, but worst case is W(n log m)

Longest common prefixes

- Definition: Lcp(i,j) is the length of the longest common prefix of the suffixes beginning at Pos(i) and Pos(j).
- Mississippi Example
- Pos(3) = 5 (issippi)
- Pos(4) = 2 (ississippi)
- Lcp(3,4) = 4

Getting to max(l,r) with Lcp’s

- L, R, M, l, r defined as before
- If l=r, compare P against Pos(m) starting at position l+1 = r+1
- Suppose l > r
- If Lcp(L,M) > l, the common prefix of suffix Pos(L) and suffix Pos(M) is longer than the common prefix of P and Pos(L)
- Therefore, P agrees with suffix Pos(M) up through position l but disagrees in position l+1
- Furthermore, Pos(M) suffix is lexically smaller than P
- Update: L = M, l and r unchanged

Getting to max(l,r) with Lcp’s

- Suppose l > r
- If Lcp(L,M) < l, the common prefix of suffix Pos(L) and suffix Pos(M) is shorter than the common prefix of P and Pos(L)
- Therefore, P agrees with suffix Pos(M) up through position Lcp(L,M).
- The Lcp(L,M)+1 characters of P and L are lexically smaller than the corresponding character of Pos(M)
- Update: R = M, r = Lcp(L,M)

Getting to max(l,r) with Lcp’s

- Suppose l > r
- If Lcp(L,M) = l, the common prefix of suffix Pos(L) and suffix Pos(M) is equal to the common prefix of P and Pos(L)
- Therefore, P agrees with suffix Pos(M) up through position l and maybe even further
- Need to compare P(l+1) to corresponding position in Pos(M)
- Update: Will update R or L according to final determination of comparisons

O(n + log m) bound

- Since we begin at max(l,r), we never compare a matched position in P more than once
- Redundant comparisons of P are eliminated to at most once per binary search phase giving us O(n + log m)

Computing Lcp values quickly

- We want to get them in O(m) time
- However, there are potentially O(m2) different possible pairs of Lcp values
- Crucial point
- Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure
- See Figure 7.7 for an example

Process for needed Lcp values

- Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf
- Other Lcp values
- Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
- Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

Lowest common ancestor

- 1-time input
- Tree T (not necessarily a suffix tree)

- Later input
- 2 nodes, v and w, of T

- Output
- lowest common ancestor of v,w in T

- Goal
- linear preprocess time
- O(1) query time

Longest Common Extension

- 1-time input
- Strings S1 and S2

- Later input
- index positions i and j

- Output
- length of longest substring of S1 beginning at i that matches substring of S2 beginning at j

- Goal
- linear preprocess time
- O(1) query time

Illustration

S1

i

- Relationship to longest common substring
- Similar, but now start positions are fixed

S2

j

Solution

- Linear Preprocessing
- Create general suffix tree for S1 and S2
- Compute string depth at each node
- Process tree to allow for constant time LCA queries
- Establish pointers to all leaf nodes in tree

- Constant time query processing
- Find u = lca(v,w)
- Output string depth of u

More space-efficient solution

- Linear Preprocessing (Assume |S2| < |S1|)
- Create general suffix tree for S2
- Compute matching statistic ms(i) and p(i) for S1
- length of longest match of substring starting at i in S1 with some substring in S2
- p(i) is the starting point of a location in S2 that matches

- Process tree to allow for constant time LCA queries
- Establish pointers to all leaf nodes in tree

- Constant time query processing
- Find u = lca(p(v), w) in suffix tree for S2
- Output min(ms(v), string depth of u)
- why is this correct?

Related Problem

- Maximal Palindromes
- Input
- String S

- Output
- Location of all maximal palindromes in S

- Solution
- Longest common extensions of specific pairs of positions in S and Sr
- O(S) solution

Common substrings revisited

- Longest common substrings of >2 strings:
- Input
- Strings S1, …, SK (total length n)

- Output
- l(j) (and pointers to substrings) for 2 <= j <= K

- Input
- Problem with previous solution
- O(kn) time to compute C(v) values
- C(v): number of distinct leaf labels in subtree rooted at node v

Definitions

- S(v): total number of leaves in v’s subtree
- U(v): number of “duplicate suffixes” from same string that occur in v’s subtree
- C(v) = S(v) - U(v)
- ni(v) = number of leaves with identifier i in the subtree rooted at node v
- ni = total number of leaves with identifier i

Key Concepts

- Definitions
- S(v): total number of leaves in v’s subtree
- U(v): number of “duplicate suffixes” from same string that occur in v’s subtree
- ni(v) = number of leaves with identifier i in the subtree rooted at node v
- ni = total number of leaves with identifier i

- Observations
- U(v) = S max((ni(v) - 1), 0)
- C(v) = S(v) - U(v)

Solution

- Computing U(v) values
- DF traversal of tree numbering leaves in order that they are encountered
- For each string label i
- Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers
- Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li
- For each node v, let h(v) denote the number of times it is the lca computed from step above

- Key property
- ni(v) = Si h(w) where w is in v’s subtree

Solution

- Computing U(v) values
- DF traversal of tree numbering leaves in order that they are encountered
- Set h(v) to 0 for all nodes v
- For each string label i
- Compute lca v of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li
- Increment h(v) by 1

- Propagate h(v) values up the tree by addition to set U(v)
- Set C(v) = S(v) - U(v)

Download Presentation

Connecting to Server..