1 / 17

CSC 213

CSC 213. Lecture 16: Strings and Pattern Matching. Announcements. Last quiz results were not good Good news: neighbors did not read book either Scores were universally poor Bad news: neighbors are also unlucky Average score was 4.2 Flipping a coin would produce an average of 5

idania
Download Presentation

CSC 213

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 213 Lecture 16:Strings and Pattern Matching

  2. Announcements • Last quiz results were not good • Good news: neighbors did not read book either • Scores were universally poor • Bad news: neighbors are also unlucky • Average score was 4.2 • Flipping a coin would produce an average of 5 • Best news: Another daily quiz!

  3. Strings (§ 11.1) • Algorithmically, any sequence of concatenated data is a string: • “CSC213 STUDENTS RAWK” • “I can’t believe this is a String.” • Java programs • HTML documents • Digitized image • DNA sequences

  4. String Terminology • String is made up of elements in an alphabet – the characters usable within a family of strings • ASCII • Unicode • Bits • Pixels • DNA bases • SubstringP[i ... j] of a string P has characters of P at ranks i to j • Any substring starting at rank 0 is called a prefix • Substrings that end at a string’s last rank is a suffix

  5. Pattern Matching Problem (§ 11.2) • Given two strings T and P, find the first substring of T that matches P • T is the “text” and Pis the “pattern” • This has many, many applications • Search engines • Database queries • Biological research

  6. Brute-Force Approach • Common method of solving problems • Easy to develop, require little coding, and needs little brain power • Instead, use computer’s raw speed to consider and analyze all possible options • This can be painfully slow and use lots of memory • Generally good for only small problems

  7. Brute-Force Pattern Matching • Compare P with all substrings in T, until • find a substring of T equal to P, or • reject all possible substrings of T • If P has size m and T has size n, this takes time O(nm) • Worst-case: • T = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • P = aaag • Common in images, DNA, & biological data

  8. Brute-Force Pattern Matching AlgorithmBruteForceMatch(String T,String P) // For each rank of T, see if starts a matching substring for i  0 to T.length()– P.length() // Compare characters in substring with compatriot in P j  0 while j < P.length()&& T.charAt(i + j)== P.charAt(j) j j +1 if j == P.length() return i // Return 1st place in T we find S return -1// No matching substring exists

  9. Your Turn • What are all of the prefixes and suffixes of the string:I am the Lizard King! • How many character comparisons does brute-force do to find a substring of:ccagcctccgccthat matches this pattern:ccgcc

  10. My Turn I am the Lizard King!

  11. Boyer-Moore Heuristics (§ 11.2.2) • Looking-glass heuristic:When comparing P and substring of T, start from the end of P and continue backward to P’s start • Character-jump heuristic:When finding a mismatch at T[i] = c • If P contains c, restart comparison so T[i] is aligned with last occurrence of c in P • Else, continue with new comparison starting at T[i+1]

  12. Last-Occurrence Function • Boyer-Moore’s precomputes the last-occurrence function • Stores last-occurrence function, L, for P in array • Example: • Consider alphabetS = {a, b, c, d, e, f} • P=badfeed Largest i where P.charAt(i) = c-1, if cis not in P L(c) =

  13. The Boyer-Moore Algorithm AlgorithmBoyerMooreMatch(String T, String P, Alphabet S)L lastOccurence (P, S)i  P.length()–1j  P.length()–1repeat if T.charAt(i)= P.charAt(j) if j =0return i // We have a match starting at i elsei  i–1j  j–1 else // We do not have a match at character i so we can skip lettersl  L[T.charAt(i)]i  i+P.length()– min(j, 1 + l)j P.length()–1until i > T.length()–1return -1

  14. Example

  15. Boyer-Moore’s Algorithm • Runs in time O(nm +S) • Worst-case: • T = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • P = baaa • May occur in images, DNA sequences • Unlikely on larger alphabets, like English • Significantly faster than brute-force algorithm on English text

  16. Your Turn • How many character comparisons does Boyer-Moore do to find a substring of:ccagcctccgccmatching this pattern:ccgcc S = {a,c,g,t} • Compute the Boyer-Moore algorithm’s last function for the string:the quick brown fox jumped over the lazy dog

  17. Your Turn • Write brute-force pattern matching method:public int bfMatch(String text,String pattern){ • Suppose we are using a non-ASCII alphabet that is stored in an arraypublic static int[] lastFn(Sequence pattern, Object[] alphabet) { • Hint: You cannot use value at each rank in pattern as index into last. May want to write a method that, given an Object and alphabet, examines each location in alphabet to return the index where the Object is found.

More Related