1 / 37

CS4413 Matching Algorithms (These materials are used in the classroom only)

CS4413 Matching Algorithms (These materials are used in the classroom only). Two important concepts. Finite automata. Character strings. String Matching with Finite Automata. Finite Automata M = (Q, q 0 , A, , ) Q: a finite set of states q 0 Q : the initial state

moya
Download Presentation

CS4413 Matching Algorithms (These materials are used in the classroom only)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS4413 Matching Algorithms (These materials are used in the classroom only)

  2. Two important concepts • Finite automata. • Character strings.

  3. String Matching with Finite Automata • Finite Automata • M = (Q, q0, A, , ) • Q: a finite set of states • q0Q: the initial state • AQ: accepting states • : input alphabets • : transition function from Q x   Q • M accepts (rejects) an input string • M acts as a final-state function  from * to Q • M scans the string w, ends up with a state (w)  A

  4. Simple Automata A simple two state finite automaton with state set Q = {0,1}, start state q0 = 0, and input alphabet ∑= {a, b}.

  5. Automata Example

  6. String Matching • PROBLEM: find the occurrence of a given substring, called a pattern, in another string, called the text. • Applications • In text processing of character strings. • In matching a string of bytes containing graphical data or machine code. • Virus checking in a computer virus. • Search for particular patterns in DNA sequences.

  7. Notation • T – the text in which we search for a pattern. • n - length of the text T. • P - the pattern being searched for. • m - length of a pattern P. • Pi , T i– the i-th character in P and T respectively.

  8. String Matching • We formalize the string-matching problem as follows: • We assume that the text is an array T[ 1...n] of length n and that the pattern is an array P[1...m] of length m. • We further assume that the elements of P and T are characters drawn from a finite alphabet ∑. For example, we may have ∑ = {0, 1} or ∑ = { a, b, ..., z }. The character arrays P and T are often called strings of characters.

  9. String Matching • We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s + 1 in text T) if 0 ≤ s ≤ n -m and T [s + 1 ..s + m] = P[1...m] (that is, if T[s + j] = P[j], for 1 ≤ j ≤ m). • If P occurs with shift s in T, then we call a valid shift; otherwise, we call an invalid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.

  10. String Matching • Finite Automata • A finite automaton M is a 5-tuple (Q, q0, A, ∑, δ), where • Q is a finite set of states • q0 Q is the start state • A ≤ Q is a distinguished set of accepting states • ∑ is a finite input alphabet • δis a function from Q  ∑ into Q, called the transition function of M.

  11. Simple String Matching • INPUT: P of length m and T of length n. • PRECONDITION: P is nonempty. • OUTPUT: The index in T where a copy of P begins or -1 if no match for P is found.

  12. Naïve String-Matching Algorithm • The naïve algorithm finds all valid shifts using a loop that checks the condition P[1 … m] = T [s+1, …, s+m] for each of the n –m + 1 possible values of s.

  13. Naïve String-Matching Algorithm Naïve-String-Matcher (T, P) • N = length [T] • M = length [P] • For j = 0 to n-m • Compare Tj Tj+1 Tj+2…Tj+m-1 to P1 P2 P3 …...Pm If all m characters are matching return j /print “pattern occurs with shift” s.

  14. Examples Example: How many comparisons (both successful and unsuccessful) will be made by the brute-force string-matching algorithm in searching for each of the following patterns in the binary text of 1000 zeros? • 00001 • 10000 • 01010

  15. Worst Case Worst case happens when each time all m-1 characters match and the last one does not. a a a b a a a a a a a a a a a a a a a a a a a a a a b Θ((n – m + 1) m) in the worst case.

  16. Analysis • The worst case is not one that occurs often in natural language text. • Empirical studies show that the algorithm did only 1.1 comparisons for each character in T (up to the point where match was found.)

  17. Analysis … • Naïve string-matcher is inefficient because information gained about the text for one value of s is totally ignored in considering other values of s. • Such information can be very valuable, however. • For example, if P = aaab and we find that s = 0 is valid, then none of the shifts 1, 2, or 3 are valid, since T[4] = b.

  18. Input Enhancement in String Matching • The Knuth-Morris-Pratt algorithm – compare left to right. • The Boyer-Moore algorithm – compare right to left, leads to simpler algorithms – Horspool’s algorithm.

  19. Horspool’s Algorithm Example: s0 ….. c ……..sn-1 B A R B E R • Case 1: if there are no c’s in the pattern – eg., c is letter S in our example – we can shift the pattern by its entire length. s0 ….. S ……..sn-1 B A R B E R B A R B E R

  20. Horspool’s Algorithm(contd..) • Case 2: if there are occurrences of character c in the pattern but it is not the last one there – e.g., c is letter B in our example – the shift should align the rightmost occurrence of c in the pattern with the c in the text. s0 ….. B ……..sn-1 B A R B E R B A R B E R

  21. Horspool’s Algorithm(contd..) • Case 3: if c happens to be the last character in the pattern but there are no c’ s among its other m-1 characters, the shift should be the entire pattern length m: s0 ….. M E R ……..sn-1 L E A D E R L E A D E R

  22. Horspool’s Algorithm(contd..) • Case 4: Finally, if c happens to be the last character in the pattern and there are other c’s among its first m-1 characters, the shift should be such that, the rightmost occurrence of c among the first m-1 characters is aligned with the text’s c: s0 ……. O R ……..sn-1 R E O R D E R R E O R D E R

  23. Horspool’s Algorithm(contd..) • Compute the shift’s value, thus: t(c) = the pattern length m, if c is not among the first m-1 characters of the pattern t(c) = the distance from the rightmost c among the first m-1 characters of the pattern to its last character, otherwise • ALGORITHMShiftTable(P[0…m-1]) //Fills the shift table used by Horspool’s and Boyer-Moore algorithms //Input: Pattern P[0….m-1] and an alphabet of possible charactrers //Output: Table[0..size-1] indexed by the alphabet’s characters and // filled with shift sizes computed by formula (7.1) initialise all the elements of Table with m for j  0 to m-2doTable[P[j]] m-1-j returnTable

  24. Horspool’s Algorithm(contd..) Horspool’s algorithm: • Step 1: For a given pattern of length m and the alphabet used in both the pattern and text, construct the shift table as described above. • Step 2: Align the pattern against the beginning of the text. • Step 3:Repeat the following until either a matching substring is found or the pattern reaches beyond the last character of the text. Starting from the last character in the pattern, compare the corresponding characters in the pattern and text until either all m characters are matched or a mismatching pair is encountered.

  25. Horspool’s Algorithm(contd..) ALGORITHMHorspoolMatching(P[0..m-1],T[0..n-1]) // Implements Horspool’s algorithm for string matching // Input: Pattern P[0..m-1] and text T[0..n-1] // Output: The index of the left end of the first matching // substring or -1 if there are no matches ShiftTable(P[0..m-1]) //generate Table of shifts im-1 //position of the pattern’s right end while i ≤ n-1 do k  0 //number of matched characters whilek ≤ m-1 and P[m-1-k]=T[i-k] k k+1 ifk=m returni-m+1 elsei  i + Table[T[i]] return -1

  26. Horspool’s Algorithm • Exercise: Apply Horspool’s algorithm to search for the pattern BAOBAB in the text BESS_KNEW_ABOUT_BAOBABS

  27. Horspool’s Algorithm • Exercise: Consider the problem of searching for genes in DNA sequences using Horspool’s algorithm. A DNA sequence is represented by a text on the alphabet {A, C, G, T}, and the gene or gene segment is the pattern. • (a) Construct the shift table for the following gene segment of your chromosome 10: TCCTATTCTT • (b) Apply Horspool’s algorithm to locate the pattern in the following DNA sequence: • TTATAGATCTCGTATTCTTTTATAGATCTCCTATTCTT

  28. Horspool’s Algorithm • Exercise: How many character comparisons will be made by Horspool’s algorithm in searching for the following patterns in the binary text of 1000 zeros? • 00001 • 10000 • 01010

  29. Prestructuring • Hashing and B-Trees are examples of presturucturing. • In general, a hash function needs to satisfy two somewhat conflicting requirements: • 1) A hash function needs to distribute keys among the cells of the hash table as evenly as possible. • 2) A hash function has to be easy to compute.

  30. Hashing … • Hashing • Hash Table • Hash Function • Hash Address • Collisions • Open Hashing (Separate Chaining) • Closed Hashing (Open Addressing) (example: Linear Probing – checks the cell following the one where the collision occurs) – implies that the table size m must be at least as large as the number of keys n.

  31. Hashing A: 1 B: 2 C:3 D: 4 ……………….. Z:26 Hash function: key mod 13

  32. Open Hashing

  33. Closed Hashing

  34. Hashing • Hash function distributes n keys among m cells of the hash table evenly, each list will be about n/m keys long. load factor: α = n/m • Efficiency of hashing (Open Hashing): • Efficiency of hashing (Closed Hashing):

  35. Hashing • Exercise: For the input 30, 20, 56, 75, 31, 19 and hash function h(K) = K mod 11 • (a) Construct the open hash table. • (b) Find the largest number of key comparisons in a successful search in this table. • (c) Find the average number of key comparisons in a successful search in this table.

  36. Hashing • Exercise: For the input 30, 20, 56, 75, 31, 19 and hash function h(K) = K mod 11 • (a) Construct the closed hash table. • (b) Find the largest number of key comparisons in a successful search in this table. • (c) Find the average number of key comparisons in a successful search in this table.

  37. END • End of Chapter 5.

More Related