string matching l.
Skip this Video
Loading SlideShow in 5 Seconds..
String Matching PowerPoint Presentation
Download Presentation
String Matching

Loading in 2 Seconds...

play fullscreen
1 / 22

String Matching - PowerPoint PPT Presentation

  • Uploaded on

String Matching. detecting the occurrence of a particular substring (pattern) in another string (text) A straightforward Solution The Knuth-Morris-Pratt Algorithm The Boyer-Moore Algorithm. Straightforward solution. Algorithm: Simple string matching

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'String Matching' - cathy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
string matching

String Matching

detecting the occurrence of a particular substring (pattern) in another string (text)

A straightforward Solution

The Knuth-Morris-Pratt Algorithm

The Boyer-Moore Algorithm

straightforward solution
Straightforward solution
  • Algorithm: Simple string matching
  • Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty.
  • Output: The return value is the index in T where a copy of P begins, or -1 if no match for P is found.
int simplescan char p char t int m
int simpleScan(char[] P,char[] T,int m)
  • int match //value to return.
  • int i,j,k;
  • match = -1;
  • j=1;k=1; i=j;
  • while(endText(T,j)==false)
  • if( k>m )
  • match = i; //match found.
  • break;
  • if(tj == pk)
  • j++; k++;
  • else
  • //Back up over matched characters.
  • int backup=k-1;
  • j = j-backup;
  • k = k-backup;
  • //Slide pattern forward,start over.
  • j++; i=j;
  • return match;
  • Worst-case complexity is in (mn)
  • Need to back up.
  • Works quite well on average for natural language.
finite automata
Finite Automata
  • Terminologies
    •  : the alphabet
    •  *: the set of all finite-length strings formed using characters from .
    • xy: concatenation of two strings x and y.
    • Prefix: a string w is a prefix of a string x if x=wy for some string y  *.
    • Suffix: a string w is a suffix of a string x if x= yw for some string y  *.
the knuth morris pratt algorithm
The Knuth-Morris-Pratt algorithm
  • Skip outer iteration I =3
  • Skip first inner iteration testing “n” vs “n” at outer iteration i=4
  • In general, if there is a partial match of j chars starting at i, then we know what is in position T[i]…T[i+j-1]. So we can save by
  • Skip outer iterations (for which no match possible)
  • Skip inner iterations (when no need to test know matches).
  • When a mismatch occurs, we want to slide P forward, but maintain the longest overlap of a prefix of P with a suffix of the part of the text that has matched the pattern so far.
  • KMP algorithm achieves linear time performance by capitalizing on the observation above, via building a simplified finite automaton: each node has only two links, success and fail.
the knuth morris pratt flowchart
The Knuth-Morris-Pratt Flowchart
  • Character labels are inside the nodes
  • Each node has two arrows out to other nodes: success link, or fail link
  • next character is read only after a success link
  • A special node, node 0, called “get next char” which read in next text character.
      • e.g. P = “ABABCB”
construction of the kmp flowchart
Construction of the KMP Flowchart
  • Definition:Fail links
    • We define fail[k] as the largest r (with r<k) such that p1, matches is the (r-1) character prefix of P is identical to the one (r-1) character substring ending at index k-1. Thus the fail links are determined by repetition within P itself.
algorithm kmp flowchart construction
Algorithm: KMP flowchart construction
  • Input: P,a string of characters;m,the length of P.
  • Output: fail,the array of failure links,defined for indexes 1,...,m.The array is passed in and the algorithm fills it.
  • Step:
  • void kmpSetup(char[] P, int m, int[] fail)
  • int k,s
  • 1. fail[1]=0;
  • 2. for(k=2;k<=m;k++)
  • 3. s=fail[k-1];
  • 4. while(s>=1)
  • 5. if(ps==pk-1)
  • 6. break;
  • 7. s=fail[s];
  • 8. fail[k]=s+1;
the knuth morris pratt scan algorithm
The Knuth-Morris-Pratt Scan Algorithm
  • int kmpScan(char[] P,char[] T,int m,int[] fail)
  • int match, j,k;
  • match= -1;
  • j=1; k=1;
  • while(endText(T,j)==false)
  • if(k>m)
  • match = j-m;
  • break;
  • if(k==0)
  • j++; k=1;
  • else if(tj==pk)
  • j++; k++;
  • else
  • //Follow fail arrow.
  • k=fail[k];
  • //continue loop.
  • return match;
  • KMP Flowchart Construction require 2m – 3 character comparisons in the worst case
  • The scan algorithm requires 2n character comparisons in the worst case
  • Overall: Worst case complexity is (n+m)
algorithm computing jumps for the boyer morre algorithm
Algorithm:Computing Jumps for the Boyer-Morre Algorithm
  • Input:Pattern string P:m the length of P;alphabet size alpha=||
  • Output:Array charJump,defined on indexes 0,....,alpha-1.The array is passed in and the algorithm fills it.
  • void computeJumps(char[] P,int m,int alpha,int[] charJump)
  • char ch; int k;
  • for (ch=0;ch<alpha;ch++)
  • charJump[ch]=m;
  • for (k=1;k<=m;k++)
  • charJump[pk]=m-k;
  • Straightforward algorithm: O(nm)
  • Finite-automata algorithm: O(n)
  • KMP algorithm: O(n+m)
    • Relatively easier to implement
    • Do not require random access to the text
  • BM algorithm: O(n+m), worst, sublinear average
    • Fewer character comparison
    • The algorithm of choice in practice for string matcing