1 / 111

String processing algorithms

String processing algorithms. David Kauchak cs161 Summer 2009. Administrative. Check your scores on coursework SCPD Final exam: e-mail me with proctor information Office hours next week? Reminder: HW6 due Wed. 8/12 before class and no late homework.

hall-robles
Download Presentation

String processing algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. String processing algorithms David Kauchak cs161 Summer 2009

  2. Administrative • Check your scores on coursework • SCPD Final exam: e-mail me with proctor information • Office hours next week? • Reminder: HW6 due Wed. 8/12 before class and no late homework

  3. Where did “dynamic programming” come from? Richard Bellman On the Birth of Dynamic Programming Stuart Dreyfus http://www.eng.tau.ac.il/~ami/cd/or50/1526-5463-2002-50-01-0048.pdf

  4. Strings • Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …, z) • A string is any member of Σ*, i.e. any sequence of 0 or more members of Σ • ‘this is a string’  Σ* • ‘this is also a string’  Σ* • ‘1234’  Σ*

  5. String operations • Given strings s1 of length n and s2 of length m • Equality: is s1 = s2? (case sensitive or insensitive) • Running time • O(n) where n is length of shortest string ‘this is a string’ = ‘this is a string’ ‘this is a string’ ≠ ‘this is another string’ ‘this is a string’ =? ‘THIS IS A STRING’

  6. String operations • Concatenate (append): create string s1s2 • Running time • Θ(n+m) ‘this is a’ . ‘ string’ → ‘this is a string’

  7. String operations • Substitute: Exchange all occurrences of a particular character with another character • Running time • Θ(n) Substitute(‘this is astring’, ‘i’, ‘x’) → ‘thxs xs a strxng’ Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

  8. String operations • Length: return the number of characters/symbols in the string • Running time • O(1) or Θ(n) depending on implementation Length(‘this is astring’) → 16 Length(‘this is another string’) → 24

  9. String operations • Prefix: Get the first j characters in the string • Running time • Θ(j) • Suffix: Get the last j characters in the string • Running time • Θ(j) Prefix(‘this is astring’, 4) → ‘this’ Suffix(‘this is astring’, 6) → ‘string’

  10. String operations • Substring – Get the characters between i and j inclusive • Running time • Θ(j - i) • Prefix? • Prefix(S, i) = Substring(S, 1, i) • Suffix? • Suffix(S, i) = Substring(S, i+1, length(n)) Substring(‘this is astring’, 4, 8) → ‘s is ’

  11. Edit distance (aka Levenshtein distance) • Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Insertion: ABACED ABACCED DABACCED Insert ‘C’ Insert ‘D’

  12. Edit distance (aka Levenshtein distance) • Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED

  13. Edit distance (aka Levenshtein distance) • Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED BACED Delete ‘A’

  14. Edit distance (aka Levenshtein distance) • Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED BACED BACE Delete ‘A’ Delete ‘D’

  15. Edit distance (aka Levenshtein distance) • Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Substitution: ABACED ABADED ABADES Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’

  16. Edit distance examples Edit(Kitten, Mitten) = 1 Operations: Sub ‘M’ for ‘K’ Mitten

  17. Edit distance examples Edit(Happy, Hilly) = 3 Operations: Sub ‘a’ for ‘i’ Hippy Sub ‘l’ for ‘p’ Hilpy Sub ‘l’ for ‘p’ Hilly

  18. Edit distance examples Edit(Banana, Car) = 5 Operations: Delete ‘B’ anana Delete ‘a’ nana Delete ‘n’ naa Sub ‘C’ for ‘n’ Caa Sub ‘a’ for ‘r’ Car

  19. Edit distance examples Edit(Simple, Apple) = 3 Operations: Delete ‘S’ imple Sub ‘A’ for ‘i’ Ample Sub ‘m’ for ‘p’ Apple

  20. Is edit distance symmetric? • that is, is Edit(s1, s2) = Edit(s2, s1)? • Why? • sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’ • delete ‘i’ → insert ‘i’ • insert ‘i’ → delete ‘i’ Edit(Simple, Apple) =? Edit(Apple, Simple)

  21. Calculating edit distance X = A B C B D A B Y = B D C A B A Ideas?

  22. Calculating edit distance X = A B C B D A ? Y = B D C A B ? After all of the operations, X needs to equal Y

  23. Calculating edit distance X = A B C B D A ? Y = B D C A B ? Operations: Insert Delete Substitute

  24. Insert X = A B C B D A ? Y = B D C A B ?

  25. Insert X = A B C B D A ? Edit Y = B D C A B ?

  26. Delete X = A B C B D A ? Y = B D C A B ?

  27. Delete X = A B C B D A ? Edit Y = B D C A B ?

  28. Substition X = A B C B D A ? Y = B D C A B ?

  29. Substition X = A B C B D A ? Edit Y = B D C A B ?

  30. Anything else? X = A B C B D A ? Y = B D C A B ?

  31. Equal X = A B C B D A ? Y = B D C A B ?

  32. Equal X = A B C B D A ? Edit Y = B D C A B ?

  33. Combining results Insert: Delete: Substitute: Equal:

  34. Combining results

  35. Running time Θ(nm)

  36. Variants • Only include insertions and deletions • What does this do to substitutions? • Include swaps, i.e. swapping two adjacent characters counts as one edit • Weight insertion, deletion and substitution differently • Weight specific character insertion, deletion and substitutions differently • Length normalize the edit distance

  37. String matching • Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

  38. String matching • Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

  39. Uses • grep/egrep • search • find • java.lang.String.contains()

  40. Naive implementation

  41. Is it correct?

  42. Running time? • What is the cost of the equality check? • Best case: O(1) • Worst case: O(m)

  43. Running time? • Best case • Θ(n) – when the first character of the pattern does not occur in the string • Worst case • O((n-m+1)m)

  44. Worst case P = AAAA S = AAAAAAAAAAAAA

  45. Worst case P = AAAA S = AAAAAAAAAAAAA

  46. Worst case P = AAAA S = AAAAAAAAAAAAA

  47. Worst case P = AAAA S = AAAAAAAAAAAAA repeated work!

  48. Worst case P = AAAA S = AAAAAAAAAAAAA Ideally, after the first match, we’d know to just check the next character to see if it is an ‘A’

  49. Patterns • Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB

  50. Patterns • Which of these patterns will have that problem? P = ABAB If the pattern has a suffix that is also a prefix then we will have this problem P = ABDC P = BAA P = ABBCDDCAABB

More Related