1 / 30

Text Searching

Text Searching. Classical Problem : Given two strings p and t, find within t a match for string p t is called the text and p is called the pattern Brute force solution : worst-case time O (|p|  |t|) Rabin-Karp : average-case time O (|p| + |t|)

john
Download Presentation

Text Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Searching • Classical Problem: Given two strings p and t, find within t a match for string p • t is called the text and p is called the pattern • Brute force solution: worst-case time O(|p||t|) • Rabin-Karp: average-case time O(|p|+|t|) • Knuth-Morris-Platt: worst-case time O(|p|+|t|) • Basic Terminology • Alphabet: a finite set  of symbols • String over : a finite sequence of symbols from , written one after another • Length of a string s: |s| = length of the sequence s (# of symbol occurrences) • s[i] : the symbol at position i in the string s • s[i..j]: the substring of s consisting of the symbols from positions i to j if j  i and equal to the empty string  if i > j • st : the concatenation of string s with string t

  2. Simple Text Search • In the ensuing discussions, we will use m to represent the length |p| of the pattern string p and n to represent the length |t| of the text string t. simple_text_search(p,t) { // returns the starting index of the first substring of t // that matches p if one exists; returns -1 otherwise m = p.lengthn = t.lengthi = 0 while (i+m  n) { // there are enough symbols left for a match with p j = 0 while (t[i+j] = p[j]) { j = j+1 if ( j  m) return i } i = i+1}return -1 } Running time: O(m(n-m+1))

  3. Example • simple_text_search searching for pattern “001” in text “010001” j=0 001010001  i=0 j=1 001 010001  i=0 j=0 001 010001  i=1 j=0 001 010001  i=2 j=1 001 010001  i=2 j=2 001 010001  i=2 j=0 001 010001  i=3 j=1 001 010001  i=3 j=2 001 010001  i=3 return 3

  4. Rabin-Karp Algorithm • The next two text-searching algorithms are based on finding ways to reduce the number of indices i of t for which we must compare t[i..i+m-1] with p • The Rabin-Karp algorithm applies to binary strings only • It eliminates indices by means of an easily computed “fingerprint” of the next m characters of t at each index • If that fingerprint does not match the fingerprint of p, the index is skipped. • One possible fingerprint is the parity(p) of the bit string p, which is the sum of its bits mod 2 • We initially compute the parity of the first m bits of t (at a preprocessing cost (m)) • Then, if we know the parity of the substring t[i..i+m-1], we can easily compute the parity of t[i+1..i+m]: (t[i+1..i+m]) = ((t[i..i+m-1]) + t[i] + t[m]) mod 2 • That computation has time cost (1) • So, for each index i of t, if (t[i..i+m-1])  (p), we skip the m bit-comparisons and set i to i+1

  5. Hashing • Parity only eliminates about ½ of the indices on average, which does not affect the asymptotic running time. • We want to eliminate all but 1/m of the comparison indices • If you want a speedup by a factor of q, you need a fingerprint function that • maps m-bit strings to q different values • distributes the m-bit strings evenly across the q values • is easy to compute sequentially • Rabin and Karp’s suggestion: view the m-bit string as the binary expansion of an integer and take that integer’s remainder after division by q • f(i) = • The choice of the value q is critical • Typically, a prime greater than m works well for distributing the string evaluations over the q values.

  6. Hashing • Rabin and Karp’s suggestion: view the m-bit string as the binary expansion of an integer and take that integer’s remainder after division by q • f(i) = • Example: pattern “001” in text “010001”, q = 3 (note: m = 3) • Fingerprint of pattern: 022 + 021 + 120 mod 3 = 1 • Fingerprints for various values of i: i = 0: 0 1 0 0 0 1 fingerprint: 022+121+020 mod 3 = 2 i = 1: 0 1 0 0 0 1 fingerprint: 122+021+020 mod 3 = 1 i = 2: 0 1 0 0 0 1 fingerprint: 022+021+020 mod 3 = 0 i = 3: 0 1 0 0 0 1 fingerprint: 022+021+120 mod 3 = 1

  7. Hashing • View the m-bit string as the binary expansion of an integer and take that integer’s remainder after division by q • f(i) = • Example: m = 3, i = 3, q = 3 • The choice of the value q is critical • Typically, a prime no smaller than m works well for distributing the string evaluations over the q values. i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 j = 0: m-1-j = m-1 = 2-0 = 2, t[i+j] = t[3+0] = t[3] = 1, so first term is 122 j = 1: m-1-j = m-2 = 2-1 = 1; t[i+j] = t[3+1] = t[4] = 0, so second term is 021 j = 2: m-1-j = m-2 = 2-2 = 0; t[i+j] = t[3+2] = t[5] = 1, so third term is 120 f(i) = f(3) = 122 + 021 + 120 = 14 + 02 + 11 = 5 mod 3 = 2

  8. Example • p = 010111, t = 0010110101001010011 • Parity: • Number of string comparisons needed: 6 • Rabin-Karp with q = 7, m = 6; f(p) = 2 • Number of string comparisons needed: 1

  9. Sequential Computation • Suppose we have a sequence s0s1…sm of bits. • Let a = f(s0s1…sm-1) = • Then

  10. Rabin-Karp Algorithm (1) (m), preprocessing cost rabin_karp_search(p,t) { m = p.lengthn = t.lengthq = prime number larger than mr = 2m-1 mod qf[0] = 0pfinger = 0 for j = 0 to m-1 { f[0] = 2*f[0] + t[j] mod q pfinger = 2*pfinger + p[j] mod q } i = 0while (i+m  n) { if (f[i] == pfinger) if (t[i..i+m-1] == p) return i f[i+1] = 2*(f[i] – r*t[i]) + t[i+m] mod q i = i+1}return -1 } Assuming speedup factor of q > m, average number of times true is O((n-m+1)/q)  O((n-m+1)/m) (m)  expected running time of while loop is O(n-m+1)  expected running time of the algorithm is O(n+m)

  11. Knuth-Morris-Pratt Algorithm • Based on the structure of the pattern string • Suppose you have matched the pattern symbols from position 0 to k with a segment of the text string and failed on the symbol with pattern position k+1 • Depending on the pattern string, you may know that you can skip some of the positions in the text string before starting your next matching attempt

  12. Knuth-Morris-Pratt Algorithm • For example: pattern = “Tweedledum” Suppose you are trying to find a match starting at character position 10 of the text string 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15x x x x x x x x x x T w e e d X Suppose you have matched the first 5 characters of the pattern but not the 6th character ( so it is not the letter ‘l’). Your next starting position for a possible match should be position 15 of the text string Why? The starting text position for a possible match must contain T The letter T only appears at the beginning of the pattern Since you have matched the first 5 characters of the pattern with the text, none of the characters in text positions 11 to 14 is a T

  13. Example • pattern p: Tweedledumtext t: Tweedledee and Tweedledum • Starting at the beginning, we see that we get a match with the first 8 characters and a mismatch at the ninth character • 0 8 pattern p: Tweedledumtext t: Tweedledee and Tweedledum • From the structure of the pattern p, we know that no match can occur starting at the 2nd to 8th characters of the text string (positions 1 to7), since none of them can contain a T. • Thus our next comparison should start at the 9th character in the text, i.e., at position 8 • pattern p: Tweedledumtext t: Tweedledee and Tweedledum • We then get a mismatch with a single comparison, so we check text position 10, etc., until we reach the text position 15, where we find a match • Note that we eliminated 8 comparisons from the simple method

  14. Example • In general, when searching for Tweeledum in a text, we are examining characters starting at some position i of the text • Note that the length of the string Tweedledum is 10 • For each integer k with 0  k < 10, if • the first k characters of Tweedledum have been matched; and • (k+1)st character does not match Then we restart the matching at text position i+(k+1) • This is because T does not occur anywhere in the string but at the beginning • Thus we can skip over any characters that have matched any part of “weedledum” • The value we add to the current text position before starting the new comparison is called the shift amount • Thus the shift amount when the first mismatch is at the (k+1)st character of Tweedledum is shift[k] = k+1.

  15. Structure of a Pattern • Suppose the pattern is the string p = pappar • Lets write the text pattern t first as a string of ? marks indicating we don’t yet know the characters. We will write an x if there is a mismatch with the pattern • Let k be the pattern position such that p[0..k] matches and p[k+1] does not • When we make a shift s, we start our comparisons at the first ? In the text. That is, we shift the pattern s positions to the right and start comparing at p[k+1-s] • Look at the possibilities: k = -1: pappar pappar x?????????????? x?????????????? k = 0: pappar pappar px????????????? p?????????????? Shift s 1 1 Not a; could be p

  16. Structure of a Pattern • Let k be the pattern position such that p[0..k] matches and p[k+1] does not • When we make a shift s, we start our comparisons at the first ? In the text. That is, we shift the pattern s positions to the right and start comparing at p[k+1-s] • Look at the possibilities: k = -1: pappar pappar x?????????????? x?????????????? k = 0: pappar pappar px????????????? p?????????????? k = 1: pappar pappar pax???????????? pa????????????? k = 2: pappar pappar papx??????????? pap???????????? Note that, unlike with the pattern Tweedledum, when the mismatch occurs at the 4th character we cannot skip over to the 5th character. This is because the 4th character is a p and could be the start of a match to pappar Shift s 1 1 2 2

  17. Structure of a Pattern • Continuing the discussion of pattern p = pappar k = 3: pappar pappar pappx?????????? papp??????????? k = 4: pappar pappar pappax????????? pappa?????????? shift 3 3

  18. The Shift Table • We summarize the shift information we derived above for the strings “pappar” and “Tweedledum” The shift tables are the key to the efficiency of the Knuth-Morris-Pratt algorithm We will give the algorithm for constructing the shift table after presenting the text matching algorithm

  19. Structure of a Pattern • Recall the discussion of pattern p = pappar k = 3: papparpapparpappx?????????? papp??????????? k = 4: papparpapparpappax????????? pappa?????????? Notice that, in the case k = 4, we do not have to start our character comparisons from the beginning of the pattern Why? Because we know there is a match with the first two characters! In general, when we do a shift, we continue the character comparisons from the pattern position immediately after the known internal matching in the pattern. • The shift amount should be the smallest integer s such that, if p[0..k] = t[i..(i+k)] and p[k+1]  t[i+k+1], then we can: • move the nextstarting point in tforward by s positions ( to i+s) • start thecharacter comparisonsby comparing p[(k+1)-s] with t[i+(k+1)] • not miss any possible matchings shift 3 3

  20. The Shift Table • What properties lead to the choice of the shift amount? • Suppose p[0..k] = t[i..(i+k)] and p[k+1]  t[i+k+1] • Consider shifting the pattern p by s positions with the goal of continuing our comparison by comparing p[k+1-s] with t[i+k+1] • For the shift to be successful, the first k-s+1 characters of p must match in the new position: p: 0 1 … s s+1 … kk+1 t: 0 1 2 … i i+1 … i+s i+s+1 … i+k i+(k+1) … p: 0 1 … k-s(k+1)-s From the diagram, it is clear that we need p[0..(k-s)] = t[(i+s)..(i+k)] Recall that we assumed that p[0..k] = t[i..i+k] It then follows that p[s..k] = t[(i+s)..(i+k)] and hence p[0..(k-s)] = p[s..k] In order to not miss any possible matchings, we should choose the smallest positive s that satisfies this condition, so: shift[k] = min { s > 0 | p[0..k-s] = p[s..k] } Continue from here

  21. The Shift Table shift[k] = min { s > 0 | p[0..k-s] = p[s..k] } Note that the condition always holds for s = k+1, since p[0..(k-(k+1)] = p[0..-1] =  and p[(k+1)..k] = . If the condition does not hold for values < k+1, then s = k+1

  22. Shifts for k = -1 and k = 0 shift[k] = min { s > 0 | p[0..k-s] = p[s..k] } • Since we will always shift at least one position to the right, we require s > 0 • If the mismatch occurs at the very first character, then k+1 = 0, hence k = -1. • For k = -1, s = 1 satisfies the condition for being the shift, since p[s..k] = p[1..-1] =  and p[0..(k-s)] = p[0..(-2)] =   shift[-1] = 1 for all patterns • If the mismatch occurs at the second character, then k+1 = 1, so k = 0 • For k = 0, s = 1 also satisfies the condition: p[s..k] = p[1..0] =  and p[0..(k-s)] = p[0..(0-1)] = p[0..(-1)] =   shift[0] = 1 for all patterns

  23. Shifts for “pappar”: Another Look • Looking again at the string “pappar”: • For k = 1 (mismatch at index k+1 = 2), check s = 1. • Compare p[s..k] = p[1..1] = p[1] = a with p[0..(k-s)] = p[0..(1-1)] = p[0..0] = p[0] = p • Since equality does not hold, shift[1] cannot be 1.  •  shift[1] = 2 (= k+1) • This means we shift the pattern by 2 and start comparisons at k+1-s = 1+1-2 = 0 • For k = 2, check s = 1. • p[s..k] = p[1..2] = “ap” and p[0..k-s] = p[0..1] = “pa” Not equal • For k = 2, check s = 2. • p[s..k] = p[2..2] = p and p[0..0] = p Equality • shift[2] = 2 • This means we shift the pattern by 2 and start comparisons at k+1-s = 2+1-2 = 1

  24. Shifts for “pappar”: Another Look • In general, we will advance i to i+s and start by comparing p[k+1-s] with t[i+k+1] • Thus we will do i = i+s and j = k+1-s and start by comparingp[j] = p[k+1-s] with t[i+j] = t[i+k+1-s] = t[i+j] • Thus if we have been comparing p[j] with t[i+j] and j reaches the point of the first mismatch, then j is the k+1 in the above discussion. • Thus we set i = i+shift[j-1] and j = j-shift[j-1] • We must be careful about the case where j-shift[j-1] < 0 • Clearly, we must start at j = 0 in those cases

  25. Knuth-Morris-Pratt Search • We assume we have available an algorithm for computing the shift table knuth_morris_pratt_search(p,t) { m = p.length n = t.length knuth_morris_pratt_shift(p,shift) // Compute the shift table i = j = 0 while (i+m ≤ n) { // there are at least m more characters in t while (t[i+j] == p[j]) // while text and pattern characters match j = j+1 if (j  m) // pattern found in the text return i } // mismatch found at position j i = i + shift[ j-1 ] j = max(j – shift[j-1],0] } // ran out of characters and no match found return -1}

  26. Computing the Shift Table • To compute the shift table for a pattern, we essentially run the KMP algorithm with t = p • Whenever we increase j in the inner loop, it is because we have found a partial match: p[0..j] = p[i..i+j] • This implies that shift[i+j]  i because shift[i+j] = min{ s > 0 | p[0..i+j-s] = p[s..i+j] } • We are also making sure we are not missing any earlier matches, so we have shift[i+j] = i and we can set the value in the shift table • Note that we are using values in the shift table to construct new values • No problem: we only ever access values in the shift table that have already been computed.

  27. Computing the Shift Table knuth_morris_pratt_shift(p,shift) { m = p.length shift[-1] = 1 // p[0]  t[i], so shift one position shift[0] = 1 // p[0..0-s] = p[0..-1] =  : smallest possible value of s is 1 // p[s..k] = p[1..0] =  and the two strings are equal i = 1 j = 0 while (i + j < m) if (p[i+j] == p[j]) { shift[i+j] = i j = j+1 } else { if (j = = 0) shift[i] = i+1 i = i+shift[j-1] j = max(j-shift[j-1],0)}}

  28. Computing the Shift Table shift[-1] = 1shift[0] = 1i = 1j = 0 while (i + j < m) if (p[i+j] == p[j]) { shift[i+j] = i j = j+1 } else { if (j = = 0) shift[i] = i+1 i = i+shift[j-1] j = max(j-shift[j-1],0)} • shift[-1] = 1 • shift[0] = 1 • i = 1, j = 0: p[1]  p[0], j == 0  shift[i] = i+1  shift[1] = 2 i =1 + shift[-1] = 1+1 = 2, j = 0 (the max) • i = 2, j = 0: p[2] = = p[0]  shift[i+j] = i  shift[2] = 2 j= j + 1 = 0 + 1 = 1 • i = 2, j = 1: p[3]  p[1], j  0 i= i + shift[j-1] = 2+shift[0] = 2+1 = 3 j = max(j-shift[j-1],0) = max(0,0) = 0 • i = 3, j = 0: p[3] == p[0  shift[i+j] = i  shift[3] = 3 j= j + 1 = 0 + 1 = 1 p a p p a r m = 6

  29. Computing the Shift Table shift[-1] = 1shift[0] = 1i = 1j = 0 while (i + j < m) if (p[i+j] == p[j]) { shift[i+j] = i j = j+1 } else { if (j = = 0) shift[i] = i+1 i = i+shift[j-1] j = max(j-shift[j-1],0)} • i = 3, j = 0: p[3] == p[0]  shift[i+j] = i  shift[3] = 3 j= j + 1 = 0 + 1 = 1  i = 3, j = 1: p[4] == p[1]  shift[i+j] = i  shift[4] = 3 j= j + 1 = 1 + 1 = 2  i = 3, j = 2: p[5]  p[2], j  0 i= i + shift[j-1] = 3+shift[1] = 3+2 = 5 j = max(j-shift[j-1],0) = max(0,0) = 0  i = 5, j = 0: p[5]  p[0], j == 0  shift[i] = i+1  shift[5] = 6 p a p p a r m = 6

  30. Homework • Page 378, numbers 2 and 3 • Page 391, numbers 2, 3, 5, 7, and 8

More Related