CS4413 Matching Algorithms (These materials are used in the classroom only)

CS4413 Matching Algorithms (These materials are used in the classroom only)

Two important concepts • Finite automata. • Character strings.

String Matching with Finite Automata • Finite Automata • M = (Q, q0, A, , ) • Q: a finite set of states • q0Q: the initial state • AQ: accepting states • : input alphabets • : transition function from Q x   Q • M accepts (rejects) an input string • M acts as a final-state function  from * to Q • M scans the string w, ends up with a state (w)  A

Simple Automata A simple two state finite automaton with state set Q = {0,1}, start state q0 = 0, and input alphabet ∑= {a, b}.

Automata Example

String Matching • PROBLEM: find the occurrence of a given substring, called a pattern, in another string, called the text. • Applications • In text processing of character strings. • In matching a string of bytes containing graphical data or machine code. • Virus checking in a computer virus. • Search for particular patterns in DNA sequences.

Notation • T – the text in which we search for a pattern. • n - length of the text T. • P - the pattern being searched for. • m - length of a pattern P. • Pi , T i– the i-th character in P and T respectively.

String Matching • We formalize the string-matching problem as follows: • We assume that the text is an array T[ 1...n] of length n and that the pattern is an array P[1...m] of length m. • We further assume that the elements of P and T are characters drawn from a finite alphabet ∑. For example, we may have ∑ = {0, 1} or ∑ = { a, b, ..., z }. The character arrays P and T are often called strings of characters.

String Matching • We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s + 1 in text T) if 0 ≤ s ≤ n -m and T [s + 1 ..s + m] = P[1...m] (that is, if T[s + j] = P[j], for 1 ≤ j ≤ m). • If P occurs with shift s in T, then we call a valid shift; otherwise, we call an invalid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.

String Matching • Finite Automata • A finite automaton M is a 5-tuple (Q, q0, A, ∑, δ), where • Q is a finite set of states • q0 Q is the start state • A ≤ Q is a distinguished set of accepting states • ∑ is a finite input alphabet • δis a function from Q  ∑ into Q, called the transition function of M.

Simple String Matching • INPUT: P of length m and T of length n. • PRECONDITION: P is nonempty. • OUTPUT: The index in T where a copy of P begins or -1 if no match for P is found.

Naïve String-Matching Algorithm • The naïve algorithm finds all valid shifts using a loop that checks the condition P[1 … m] = T [s+1, …, s+m] for each of the n –m + 1 possible values of s.

Naïve String-Matching Algorithm Naïve-String-Matcher (T, P) • N = length [T] • M = length [P] • For j = 0 to n-m • Compare Tj Tj+1 Tj+2…Tj+m-1 to P1 P2 P3 …...Pm If all m characters are matching return j /print “pattern occurs with shift” s.

Examples Example: How many comparisons (both successful and unsuccessful) will be made by the brute-force string-matching algorithm in searching for each of the following patterns in the binary text of 1000 zeros? • 00001 • 10000 • 01010

Worst Case Worst case happens when each time all m-1 characters match and the last one does not. a a a b a a a a a a a a a a a a a a a a a a a a a a b Θ((n – m + 1) m) in the worst case.

Analysis • The worst case is not one that occurs often in natural language text. • Empirical studies show that the algorithm did only 1.1 comparisons for each character in T (up to the point where match was found.)

Analysis … • Naïve string-matcher is inefficient because information gained about the text for one value of s is totally ignored in considering other values of s. • Such information can be very valuable, however. • For example, if P = aaab and we find that s = 0 is valid, then none of the shifts 1, 2, or 3 are valid, since T[4] = b.

Input Enhancement in String Matching • The Knuth-Morris-Pratt algorithm – compare left to right. • The Boyer-Moore algorithm – compare right to left, leads to simpler algorithms – Horspool’s algorithm.

Horspool’s Algorithm Example: s0 ….. c ……..sn-1 B A R B E R • Case 1: if there are no c’s in the pattern – eg., c is letter S in our example – we can shift the pattern by its entire length. s0 ….. S ……..sn-1 B A R B E R B A R B E R

Horspool’s Algorithm(contd..) • Case 2: if there are occurrences of character c in the pattern but it is not the last one there – e.g., c is letter B in our example – the shift should align the rightmost occurrence of c in the pattern with the c in the text. s0 ….. B ……..sn-1 B A R B E R B A R B E R

Horspool’s Algorithm(contd..) • Case 3: if c happens to be the last character in the pattern but there are no c’ s among its other m-1 characters, the shift should be the entire pattern length m: s0 ….. M E R ……..sn-1 L E A D E R L E A D E R

Horspool’s Algorithm(contd..) • Case 4: Finally, if c happens to be the last character in the pattern and there are other c’s among its first m-1 characters, the shift should be such that, the rightmost occurrence of c among the first m-1 characters is aligned with the text’s c: s0 ……. O R ……..sn-1 R E O R D E R R E O R D E R

Horspool’s Algorithm(contd..) • Compute the shift’s value, thus: t(c) = the pattern length m, if c is not among the first m-1 characters of the pattern t(c) = the distance from the rightmost c among the first m-1 characters of the pattern to its last character, otherwise • ALGORITHMShiftTable(P[0…m-1]) //Fills the shift table used by Horspool’s and Boyer-Moore algorithms //Input: Pattern P[0….m-1] and an alphabet of possible charactrers //Output: Table[0..size-1] indexed by the alphabet’s characters and // filled with shift sizes computed by formula (7.1) initialise all the elements of Table with m for j  0 to m-2doTable[P[j]] m-1-j returnTable

Horspool’s Algorithm(contd..) Horspool’s algorithm: • Step 1: For a given pattern of length m and the alphabet used in both the pattern and text, construct the shift table as described above. • Step 2: Align the pattern against the beginning of the text. • Step 3:Repeat the following until either a matching substring is found or the pattern reaches beyond the last character of the text. Starting from the last character in the pattern, compare the corresponding characters in the pattern and text until either all m characters are matched or a mismatching pair is encountered.

Horspool’s Algorithm(contd..) ALGORITHMHorspoolMatching(P[0..m-1],T[0..n-1]) // Implements Horspool’s algorithm for string matching // Input: Pattern P[0..m-1] and text T[0..n-1] // Output: The index of the left end of the first matching // substring or -1 if there are no matches ShiftTable(P[0..m-1]) //generate Table of shifts im-1 //position of the pattern’s right end while i ≤ n-1 do k  0 //number of matched characters whilek ≤ m-1 and P[m-1-k]=T[i-k] k k+1 ifk=m returni-m+1 elsei  i + Table[T[i]] return -1

Horspool’s Algorithm • Exercise: Apply Horspool’s algorithm to search for the pattern BAOBAB in the text BESS_KNEW_ABOUT_BAOBABS

Horspool’s Algorithm • Exercise: Consider the problem of searching for genes in DNA sequences using Horspool’s algorithm. A DNA sequence is represented by a text on the alphabet {A, C, G, T}, and the gene or gene segment is the pattern. • (a) Construct the shift table for the following gene segment of your chromosome 10: TCCTATTCTT • (b) Apply Horspool’s algorithm to locate the pattern in the following DNA sequence: • TTATAGATCTCGTATTCTTTTATAGATCTCCTATTCTT

Horspool’s Algorithm • Exercise: How many character comparisons will be made by Horspool’s algorithm in searching for the following patterns in the binary text of 1000 zeros? • 00001 • 10000 • 01010

Prestructuring • Hashing and B-Trees are examples of presturucturing. • In general, a hash function needs to satisfy two somewhat conflicting requirements: • 1) A hash function needs to distribute keys among the cells of the hash table as evenly as possible. • 2) A hash function has to be easy to compute.

Hashing … • Hashing • Hash Table • Hash Function • Hash Address • Collisions • Open Hashing (Separate Chaining) • Closed Hashing (Open Addressing) (example: Linear Probing – checks the cell following the one where the collision occurs) – implies that the table size m must be at least as large as the number of keys n.

Hashing A: 1 B: 2 C:3 D: 4 ……………….. Z:26 Hash function: key mod 13

Open Hashing

Closed Hashing

Hashing • Hash function distributes n keys among m cells of the hash table evenly, each list will be about n/m keys long. load factor: α = n/m • Efficiency of hashing (Open Hashing): • Efficiency of hashing (Closed Hashing):

Hashing • Exercise: For the input 30, 20, 56, 75, 31, 19 and hash function h(K) = K mod 11 • (a) Construct the open hash table. • (b) Find the largest number of key comparisons in a successful search in this table. • (c) Find the average number of key comparisons in a successful search in this table.

Hashing • Exercise: For the input 30, 20, 56, 75, 31, 19 and hash function h(K) = K mod 11 • (a) Construct the closed hash table. • (b) Find the largest number of key comparisons in a successful search in this table. • (c) Find the average number of key comparisons in a successful search in this table.

END • End of Chapter 5.

CS4413 Matching Algorithms (These materials are used in the classroom only)