1 / 68

Pattern Matching

Pattern Matching. Strings. A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T}.

yehudi
Download Presentation

Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching Pattern Matching

  2. Strings • A string is a sequence of characters • Examples of strings: • Java program • HTML document • DNA sequence • Digitized image • An alphabet S is the set of possible characters for a family of strings • Example of alphabets: • ASCII • Unicode • {0, 1} • {A, C, G, T} • Let P be a string of size m • A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j • A prefix of P is a substring of the type P[0 .. i] • A suffix of P is a substring of the type P[i ..m -1] • Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P • Applications: • Text editors • Search engines • Biological research Pattern Matching

  3. Brute-Force Algorithm • The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either • a match is found, or • all placements of the pattern have been tried • Brute-force pattern matching runs in time O(nm) • Example of worst case: • T = aaa … ah • P = aaah • may occur in images and DNA sequences • unlikely in English text AlgorithmBruteForceMatch(T, P) Inputtext T of size n and pattern P of size m Outputstarting index of a substring of T equal to P or -1 if no such substring exists for i  0 to n - m { test shift i of the pattern } j  0 while j < m  T[i + j]= P[j] j j +1 if j = m return i {match at i} else break while loop {mismatch} return -1{no match anywhere} Pattern Matching

  4. Boyer-Moore Heuristics • The Boyer-Moore’s pattern matching algorithm is based on two heuristics Start at the end: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i] = c • If P contains c, shift P to align the last occurrence of c in P with T[i] • Else, shift P to align P[0] with T[i + 1] • Example Pattern Matching

  5. Last-Occurrence Function • Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet S to build the last-occurrence function L mapping S to integers, where L(c) is defined as • the largest index i such that P[i]=c or • -1 if no such index exists • Example: • P=abacab • S = {a, b, c, d} • The last-occurrence function can be represented by an array indexed by the numeric codes of the characters • The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of S Pattern Matching

  6. Case 2: 1+l j Case 1: j  1+l The Boyer-Moore Algorithm Last is abbreviated “l” in figs AlgorithmBoyerMooreMatch(T, P, S) L is lastOccurenceFunction(P, S ) i  m-1 j  m-1 repeat if T[i]= P[j] if j =0 return i { match at i } else i  i-1 j  j-1 else { character-jump } last L[T[i]] i  i+ m – min(j, last+ 1) j m-1 until i >n–1 // beyond text length return -1 { no match } Pattern Matching

  7. Update function? • i i+ m – min(j, last +1) • Why the min? • If the last character is to the left of where you are looking, you will just shift so that the characters align. That amount is (m-1ast+l). • If the last character is to the right of where you are looking, that would require a NEGATIVE shift to align them. That is not good. In that case, the whole pattern just shifts 1. HOWEVER, the code to do that is NOT obvious. • Recall j starts out at m-1 and then decreases. i also starts out at the end of the pattern and decreases. Thus, if you add to i m-j, you are really moving the starting point to just one higher than the starting value for i for the current pass. Try it with some numbers to see. Pattern Matching

  8. Example i=4 j=3 last(a) = 4 j<last(a)+1 m=6 SOOO i=4+6-3 = 7 0 1 2 3 4 5 6 7 8 9 Last a is to the right of where we are currently. Just shift pattern to right by 1 Pattern Matching

  9. Analysis • Boyer-Moore’s algorithm runs in time O(nm + ||) • The || comes from initialzing the last function. We expect || to be less than nm, but if it weren’t we add it to be safe. • Example of worst case: • T =aaa … a • P =baaa • The worst case may occur in images and DNA sequences but is unlikely in English text • Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text Pattern Matching

  10. The KMP Algorithm - Motivation } • Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. • It takes advantage of the fact that we KNOW what we have already seen in matching the pattern. • When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? • Answer: the largest prefix of P[0..j]that is a suffix of P[1..j] At this point, the prefix of the pattern matches the suffix of the PARTIAL pattern a b a a b x . . . . . . . a b a a b a j a b a a b a No need to repeat these comparisons Resume comparing here by setting j to 2 Pattern Matching

  11. KMP Failure Function • Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself • The failure functionF(j) is defined as the size of the largest prefix of the pattern that is also a suffix of the partial pattern P[1..j] • Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j] T[i] we set j  F(j-1), which resets where we are in the pattern. • Notice we don’t reset i, but just continue from this point on. Pattern Matching

  12. The KMP Algorithm • At each iteration of the while-loop, either • i increases by one, or it doesn’t • the shift amount i - j increases by at least one (observe that F(j-1)< j) • One worries that they may be stuck in the section where i doesn’t increase. • Amortized Analysis: while sometimes we just shift the pattern without moving i, we can’t do that forever, as we have to have moved forward in i and j before we can just shift the pattern. • Hence, there are no more than 2n iterations of the while-loop • Thus, KMP’s algorithm runs in optimal time O(n) plus the cost of computing the failure function. AlgorithmKMPMatch(T, P) F failureFunction(P) i  0 j  0 while i <n if T[i]= P[j] if j =m-1 return i -j{ match } else // keep going i  i+1 j  j+1 else if j >0 j  F[j-1] else // at first position can’t // use the failure function i  i+1 return -1 { no match } Pattern Matching

  13. Computing the Failure Function • The failure function can be represented by an array and can be computed in O(m) time • The construction is similar to the KMP algorithm itself • At each iteration of the while-loop, either • i increases by one, or • the shift amount i - j increases by at least one (observe that F(j-1)< j) • Hence, there are no more than 2m iterations of the while-loop • So the total complexity of KMP is O(m+n) AlgorithmfailureFunction(P) F[0] 0 i  1 j  0 while i <m if P[i]= P[j] {we have matched j + 1 chars} F[i]j+1 i  i+1 j  j+1 else if j >0 then {use failure function to shift P} j  F[j-1] else F[i] 0 { no match } i  i+1 Pattern Matching

  14. Example Fail at 6, reset j to 1 (F of prev loc) Note,we start comparing from the left. Fail at 7, reset j to 0 (F of prev loc) Fail at 12, reset j to 0 (F of prev loc) Fail at 13, reset j to 0 Pattern Matching

  15. Binary Failure Function • Your programming assignment (due 10/29) extends the idea of the KMP string matching. • If the input is binary in nature (only two symbols are used - such as x/y or 0/1), when you fail to match an x, you know you are looking at a y. • Normally, when you fail, you say – “How much of the PREVIOUS pattern matches?” And then check the current location again. • With binary input, you can say, “How much of the string including the real value of what I was trying to match can shift on top of each other?” Pattern Matching

  16. Example Pattern Matching

  17. Can you find the Binary Failure Function (given the regular failure function?) for the TWO pattern strings below? Pattern Matching

  18. Can you find the Binary Failure Function (given the regular failure function?) Pattern Matching

  19. Tries Tries

  20. Preprocessing Strings • Preprocessing the pattern speeds up pattern matching queries • After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size • If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern • A trie (pronounced TRY) is a compact data structure for representing a set of strings, such as all the words in a text • A tries supports pattern matching queries in time proportional to the pattern size Tries

  21. Standard Trie • The standard trie for a set of strings S is an ordered tree such that: • Each node but the root is labeled with a character • The children of a node are alphabetically ordered • The paths from the external nodes to the root yield the strings of S • Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } Tries

  22. Standard Trie (cont) • A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet Tries

  23. Word Matching with a Trie • We insert the words of the text into a trie • Each leaf stores the occurrences of the associated word in the text Tries

  24. Compressed Trie • A compressed trie has internal nodes of degree at least two • It is obtained from standard trie by compressing chains of “redundant” nodes Tries

  25. Compact Representation • Compact representation of a compressed trie for an array ofstrings: • Stores at the nodes ranges of indices instead of substrings in order to make nodes a fixed size • Uses O(s) space, where s is the number of strings in the array • Serves as an auxiliary index structure b s hear e id to u e ar ll ll y ll e ck p Tries

  26. Suffix Trie • The compressed tree doesn’t work if you don’t start at the beginning of the word. Suppose you were allowed to start ANYWHERE in the word. • The suffix trie of a string X is the compressed trie of all the suffixes of X Tries

  27. Suffix Trie (cont) – showing as numbers • Compact representation of the suffix trie for a string X of size n from an alphabet of size d • Uses O(n) space • Supports arbitrary pattern matching queries in X in O(dm) time, where m is the size of the pattern Tries

  28. a d e b c Encoding Trie • A code is a mapping of each character of an alphabet to a binary code-word • A prefix code is a binary code such that no code-word is the prefix of another code-word • An encoding trie represents a prefix code • Each leaf stores a character • The code word of a character is given by the path from the root to the leaf storing the character (0 for a left child and 1 for a right child Tries

  29. c a d b b r a c r d Encoding Trie (cont) • Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X • Frequent characters should have short code-words • Rare characters should have long code-words • Example • X =abracadabra • T1 encodes X into 29 bits • T2 encodes X into 24 bits T1 T2 Tries

  30. Huffman’s Algorithm • Given a string X, Huffman’s algorithm constructs a prefix code that minimizes the size of the encoding of the string. • It runs in timeO(n + d log d), where n is the size of the string and d is the number of distinct characters of the string • A heap-based priority queue is used as an auxiliary structure AlgorithmHuffmanEncoding(X) Inputstring X of size n Outputoptimal encoding trie for X C distinctCharacters(X) computeFrequencies(C, X) Qnew empty heap for all c  C Tnew single-node tree storing c Q.insert(getFrequency(c), T) while Q.size()> 1 f1 Q.minKey() T1 Q.removeMin() f2 Q.minKey() T2 Q.removeMin() Tjoin(T1, T2) Q.insert(f1+ f2, T) return Q.removeMin() Tries

  31. 2 4 a c d b r 5 6 a b c d r 2 4 5 2 1 1 2 11 a c d b r 6 a 5 2 2 4 a b c d r c d b r 5 2 2 Example X= abracadabra Frequencies Choice of which two Is not unique Tries

  32. At Seats X= catinthehat Frequencies in English text Create tree Use to decode

  33. Text SimilarityDetect similarity to focus on, or ignore, slight differencesa. DNA analysisb. Web crawlers omit duplicate pages, distinguish between similar onesc. Updated files, archiving, delta files, and editing distance Pattern Matching

  34. Longest Common SubsequenceOne measure of similarity is the length of the longest common subsequence between two texts. This is NOT a contiguous substring, so it loses a great deal of structure. I doubt that it is an effective metric for all types of similarity, unless the subsequence is a substantial part of the whole text. Pattern Matching

  35. LCS algorithm uses the dynamic programming approachRecall: the first step is to find the recursion. How do we write LCS in terms of other LCS problems? The parameters for the smaller problems being composed to solve a larger problem are the lengths of a prefix of X and a prefix of Y. Pattern Matching

  36. Find recursion: Let L(i,j) be the length of the LCS between two strings X(0..i) and Y(0..j).Suppose we know L(i, j), L(i+1, j) and L(i, j+1) and want to know L(i+1, j+1). a. If X[i+1] = Y[j+1] then the best we can do is to get a LCS of L(i, j) + 1.b. If X[i+1] != Y[j+1] then it is max(L[i, j+1], L(i+1, j)) Pattern Matching

  37. Longest Common SubsequenceOne measure of similarity is the length of the longest common subsequence between two texts. Pattern Matching

  38. This algorithm initializes the array or table for L by putting 0’s along the borders, then is a simple nested loop filling up values row by row. This it runs in O(nm)While the algorithm only tells the length of the LCS, the actual string can easily be found by working backward through the table (and strings), noting points at which the two characters are equal Pattern Matching

  39. Longest Common SubsequenceMark with info to generate string Every diagonal shows what is part of LCS Pattern Matching

  40. Try this one… Pattern Matching

  41. The rest of the material in these notes is not in your text (except as exercises) • Sequence Comparisons • Problems in molecular biology involve finding the minimum number of edit steps which are required to change one string into another. • Three types of edit steps: insert, delete, replace. • (replace may cost extra as it is like delete and insert) • The non-edit step is “match” – costing zero. • Example: abbc babb • abbc  bbc  babc  babb (3 steps) • abbc  babbc  babb (2 steps) • We are trying to minimize the number of steps. Pattern Matching

  42. Idea: look at making just one position right. Find all the ways you could use. • Count how long each would take (using recursion) and figure best cost. • Then use dynamic programming. Orderly way of limiting the exponential number of combinations to think about. • For ease in coding, we make the last character correct (rather than any other). Pattern Matching

  43. First steps to dynamic programming • Think of the problem recursively. Find your prototype – what comes in and what comes out. • Int C(n,m) returns the cost of turning the first n characters of the source string (A) into the first m characters of the destination string (B). • Now, find the recursion. You have a helper who will do ANY smaller sub-problem of the same variety. What will you have them do? Be lazy. Let the helper do MOST of the work. Pattern Matching

  44. Types of edit steps: insert, delete, replace, match. Consider match to be “free” but the others to cost 1.There are four possibilities (pick the cheapest) 1. If we delete an, we need to change A(0..n-1) to B(0..m). The cost is C(n,m) = C(n-1,m) + 1 C(n,m) is the cost of changing the first n of str1 to the first m of str2. 2. If we insert a new value at the end of A(n) to match bm, we would still have to change A(n) to B(m-1). The cost is C(n,m) = C(n,m-1) + 1 3. If we replace an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1 4. If we match an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) Pattern Matching

  45. We have turned one problem into three problems - just slightly smaller. • Bad situation - unless we can reuse results. Dynamic Programming. • We store the results of C(i,j) for i = 1,n and j = 1,m. • If we need to reconstruct how we would achieve the change, we store both the cost and an indication of which set of subproblems was used. Pattern Matching

  46. M(i,j) which indicates which of the four decisions lead to the best result. • Complexity: O(mn) - but needs O(mn) space as well. • Consider changing do to redo: • Consider changing mane to mean: Pattern Matching

  47. At your seats try Changing “mane” to “mean” Pattern Matching

  48. Changing “mane” to “mean” Pattern Matching

  49. Changing “do” to “redo”Assume: match is free; others are 1.I show the choices as I- or R-, etc, but could have shown with an arrow as well Pattern Matching

  50. Another problem:Longest Increasing Subsequence of single list Find the longest increasing subsequence in a sequence of distinct integers. Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11 Why do we care? Classic problem: 1. computational biology: related to MUMmer system for aligning genomes. 2. Card games 3. Airline boarding problem 4. Maximization problem in a random environment. How do we solve? Pattern Matching

More Related