1 / 39

Dynamic Text and Static Pattern Matching

Dynamic Text and Static Pattern Matching. Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University. Classical Pattern Matching. Output: locations of T where P appears.

Download Presentation

Dynamic Text and Static Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University

  2. Classical Pattern Matching Output: locations of T where P appears. Input: - Pattern P = p1p2…pm - Text T = t1 t2 t3 . . . tn over alphabet Σ. • m is the PATTERN size. • nis the TEXT size.

  3. Pattern Matching (eg.) T=aaagcattagctagcagcat Input: P=agca= {a,g,c,t}

  4. Pattern Matching (eg.) T=aaagcattagctagcagcat Input: P=agca= {a,g,c,t} 3 , 13 , 16,… Output: 1 2 3 4 5 6 … 13. . . 16

  5. “Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. • C. Dynamic Text and Static Pattern.

  6. “Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

  7. “Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

  8. “Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time Time: query - O(m + log2n) change - O(log2n) • C. Dynamic Text and Static Pattern?

  9. Dynamic Text and Static Pattern Matching • Pattern is non-changing • Text changes over time • Goal: report new occurrences of the pattern without performing a new search.

  10. Motivation • Intrusion detection systems • 2. Info alerts • 3. Two-dimensional run-length compressed matching problem, [ALS03] FAX a14 a4b2c3d5 c8a6

  11. Problem Definition • Input: T and P over Σ ={1, …, m}. • Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T. Change Operation:change one character in the text, e.g. location 5 from a to b.

  12. Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g c g a g c a t

  13. Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g a g a g c a t 10

  14. Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g a g a g c a t 8 10 • Output: {8}

  15. Results After O(n log log m + ) preprocessing time, O(log log m) time per replacement.

  16. “Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time Time: query - O(m + log2n) change - O(log2n) • C.DynamicText and Static Pattern. Time: change and announce O(log log m)

  17. Static Stage • To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77]. • All pattern occurrences in a text of length 2m can be stored in O(1) space.

  18. Succinct Output Assumption: the text is of size 2m. (Break the text T into overlapping strings of length 2m-1. ) 1m 2m 3m 4m T P

  19. Succinct Output (cont.) • P is periodic: A string p is periodicif it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabca Store the output as a ‘chain’ of pattern occurrences. • P is non-periodic: By definition, no more than two occurrences.

  20. On-line Algorithm Following each replacement: • Delete old matches that are no longer pattern occurrences. • Find new matches.

  21. Delete Old Matches Deleting is trivial since we store the matches in constant space: • P is periodic: Truncate the chain of pattern occurrences. • P is non-periodic: Discard all matches that are within distance -m of the replacement.

  22. Find New Matches • Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

  23. Main Idea - Text Covers We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P. Pattern 1 2 3 4 5 6 7 = a g a g a g c Text = g a g a g c t a g c g a g c a t g a g a g c g a g c a g c a Cover: [ 2,7] [5,7] [4,7] [1,1]

  24. Text Cover (cont.) The text cover must satisfy two properties: • Substring Property: each element of the cover is a substring of P, or a character not included in P. • Maximality Property: no two adjacent elements can concatenate to form a substring of P.

  25. Text Cover (cont.) • Initially, in the static stage, we construct a text cover for T. • We ensure that the cover satisfies both the substring and maximality property. How does a replacement in the text affect the text cover?

  26. Text Cover following replacement 1 2 3 4 5 6 7 Pattern = a g a g a g c Text = g a g a g c t a g c g a g c a t a g a g a g c, a g c, g a g c, a Cover: (2,7) - (5,7) (4,7) (1,1) - (2,7) -(5, 6)(1,1) (4,7) (1,1) - (1,3) (1,7)

  27. Updating the Text Cover At most 5 pieces can violate the maximality property.

  28. Substring Concatenation Query • Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P? • Query time:O(log log m). • Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

  29. Find New Matches • Given: a text cover which satisfies both the substring and maximality properties. • Find: all new locations of the pattern in the text.

  30. Key Observations • A new match must begin within distance -m of the change. • A new match can include at mostone entire piece of the cover. • It can span at most three pieces of the cover.

  31. Furthermore A new match can begin in one of at most three pieces of the cover: • the piece with the change • the previous piece • the one previous to that T P

  32. Simplified Problem • Search starts within piece of cover. • Simple O(m) time algorithm: • Check each location in X for a pattern start. • Use suffix trees and LCA queries to compare substrings in constant time. X T P

  33. Improved Algorithm • Really, we only have to check each suffix of X that is a pattern prefix. e.g. X = a g a g a • The KMP automaton can give the necessary information. However, the time is still O(m) !

  34. Improved Algorithm • We can group the prefixes of P by their periods. • Each group of prefixes can be checked in constant time! • There are at most O(log m) groups.

  35. Groups (eg.) 1 2 3 4 5 6 7 Pattern = a g a g a g c X = a g a g a There are three suffixes of X that are also pattern prefixes: { agaga, aga } { a } Prefixes with the same period fall into a single group.

  36. Checking a group in Constant Time 1 2 3 4 5 6 7 Pattern = a g a g a g c a g a g a a g t . . . a g a g a g a g a g c g c . . . X = a g a g a Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’

  37. Groups • A string cannot have more than O(log m) border groups. • Hence, the time of the algorithm is O(log m). [Intuition: each new group has a new period which has to be at least double the size of the old period. e.g.aagaagaa]

  38. Even Better... • We check only a constant number of groups. • Choosing these O(1) groups takes O(log log m) time. • Hence, our algorithm takes O(log log m) time per replacement.

  39. Open Problems • Allowing insertions and deletions to the text. • Searching for a set of multiple static patterns.

More Related