1 / 39

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines. A Presentation by Ian Graham Carnegie Mellon University August 2, 2002. The March of Progress. 1. Literal string search (exact substring) 2. Extended string search (character classes)

qabil
Download Presentation

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kleene Would Be ShockedRedrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002

  2. The March of Progress • 1. Literal string search (exact substring) • 2. Extended string search (character classes) • 3. Regular expression matching • 4. Approximate matching • 5. “Extended” regular expression matching

  3. Begin at the Beginning • The simplest case of a regular expression is a literal string search • Literal—any symbol in the alphabet • Literal string—a concatenation of literals • Literal string search—the problem of finding all occurrences of one literal string within another literal string (find “cad” in “abracadabra”)

  4. Quick Review • Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM) • Two classical literal string search algorithms • About 25 years old • Used to achieve O(m+n) search performance, where m is the length of the search pattern and n is the length of the text to be searched

  5. Quick Review • KMP scans from left to right, shifting by aligning the longest prefix of the search pattern which matches a suffix of the text scanned • BM scans from right to left along a window that shifts from left to right by choosing the largest shift amount from multiple shift rules

  6. Practical Developments • For any alphabet size, there is always an algorithm which achieves better experimental results than KMP or BM. • The Horspool algorithm (1980) simplifies BM, using only the bad character shift rule instead of calculating multiple shift amounts and using the best

  7. Practical Developments • Horspool is O(m+n) in the average case (assuming equal probability of all alphabet characters), O(mn) in the worst case • BM is O(m+n) average, O(m+n) worst • Evaluating multiple shift rules for BM greatly increases its runtime constant • Horspool is much faster in practice, and is extremely hard for any algorithm to beat over large alphabets

  8. Bit-Parallelism • Recent algorithms (1992~) create nondeterministic automata to keep track of each possible match along the length of the pattern • States of these NFAs are mapped to bits in a word, and transitions are simulated utilizing the parallelism of bitwise operations

  9. Bit-Parallelism • Possible matches may be represented by “1”s, and proceed in parallel along the pattern until they reach the end, indicating a match

  10. Bit-Parallelism • Savings due to parallelism depends on the word size • Bit-parallel algorithms often only perform well for patterns of size near to or less than the word size

  11. Bit-Parallelism • Most analysis assumes constant word size, either 32 or 64 bits • Savings under this assumption are constant, but result in extremely good performance for practical applications

  12. A Wrench • Let a “character class” be an item which matches a single character from a range or explicit list. • Examples • [0-9] matches any digit • [Aa] matches A or a • [A-Za-z] matches any English letter

  13. A Wrench • Let an “extended string” be a literal string with the additional property that it may contain character classes in place of literals. • Examples: • “abc[de]f” matches “abcdf or “abcef” • “[Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches “aneurism”, “ANEURISM”, “aNeUrIsM”…

  14. A Wrench • Moving from literal string searches to extended string searches confounds many algorithms • Horspool may be extended, but its performance suffers greatly • Boyer-Moore may also be extended, and performs better than other well-known extensions

  15. Bit-Parallelism on Top? • A recent (Navarro and Raffinot, 1998) bit-parallel algorithm claims to be 10-40% faster than any known variant of BM • Appears to be the fastest algorithm given: • moderate-sized alphabet (e.g. English) • moderate pattern sizes (5-110 characters)

  16. What is a Regular Expression? • Says Stephen Kleene: • “A notation to describe regular languages.” • “A description of the behavior of a finite state machine.” • “Regular.”

  17. A Familiar Definition • 1. a for some a in the alphabet Σ • 2. ε • 3. the null language • 4. R1 U R2 (R1, R2 regular languages) • 5. R1 ◦ R2 (R1, R2 regular languages) • 6. R1* (R1 regular language)

  18. Efficiently Matching Regular Expressions • Attempts to extend classical literal search algorithms to process regular expressions have largely been fruitless • Efficient algorithms involve clever ways of simulating an NFA equivalent of the regular expression

  19. Efficiently Matching Regular Expressions • For small to moderate pattern sizes, optimizations using bit-parallelism appear to result in the fastest algorithms (Navarro, Raffinot) • For large pattern sizes (greater than about 4 times the word size), partial conversion from NFA to DFA results in good performance

  20. Where can we go from here? • Approximate matching—match a literal string to within some “difference” • Edit distance is commonly used • Rules much more complex for computational biology applications • Extensions to regular expressions • Used by most languages and applications

  21. Where can we go from here? • Efficiently handling regular expressions and approximate matching are problems in much of today’s research • Flexible Pattern Matching in Strings, by Navarro and Raffinot, referenced here, was published June 15, 2002

  22. What is a Regular Expression? • Say modern developers: • A pattern that can be matched against a string • Not necessarily a model of any particular machine • Not necessarily (and not usually) regular • A very powerful tool for solving text-based problems

  23. Who uses regular expressions? • Where to find built-in “regular expression” support today? • awk, grep, sed, vi, emacs, find, more, less, lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, Python, Java, Microsoft .NET, and many, many more • Built-in support has become more frequent and more advanced in the past few years

  24. Irregular Regular Expressions? • The languages described by most popular “regular expression” engines are NP-Hard • Construction of a “regular expression” in Perl which matches representations of 3-colorable graphs is fairly straightforward

  25. Irregular Regular Expressions? Perl “regular expression” which matches any 3-colorable graph, given a number of vertices V and an edge-list E: $string = (join "\n", (("rgb") x $V)) . "\n:\n" . join "\n", (("rgbrbgr") x @E); $regex = '^‘ . (join "\\n", (".*(.).*") x $V) . "\\n:\\n" . (join "\\n", map {".*\\$_->[0]\\$_->[1].*"} @E) . '$' ; 3-colorable iff $regex matches $string (http://perl.plover.com/NPC/NPC-3COL.html)

  26. Irregular Regular Expressions? • Usage of the term “regular expression” in modern development conflicts with its theoretical definition • Many are unaware of or ignore this conflict, while others choose different terminology: • “Extended regular expression” • “Regex”

  27. Clear Definitions • Regular expression—a description of a regular language, as defined by Kleene • Regex—any pattern matched against a string, not necessarily regular

  28. The Main Culprit • Backreferences • Ability to refer to text that has been matched in a previous part of the regex • Typically expressed as \n, where n is a number—refers to the text matched by the regex inside the nth set of parenthesis • “(.*)\1\1” matches “abcabcabc”, “abaabaaba”... • “\b(\w+)\b\s+\b\1\b” matches “the the”, “a a”…(double words)

  29. Backreferences • Supported in limited number by vi, sed, grep, emacs, Ruby, Python, PHP • POSIX standard for Basic Regular Expressions includes capability to process nine backreferences • Bounding the number available places a bound on performance

  30. Backreferences • Supported without quantity bounds by Perl 5 and later, Tcl, Java 1.4, .NET • Number of backreferences limited only by physical memory restrictions

  31. Backreferences • Slow—processing a regex becomes NP-Hard (for unbound amounts of backreferences) • Extremely useful—add a great deal of expressive power to a regex • Largely untouched by theoretical analysis • No real bounds on efficiency

  32. Lookahead • Also known as “zero-width matching” • Ability to check text ahead without consuming it in a match • Typically expressed as (?text) • Example • “abc(?def)” will match “abc”, but only if followed by “def”

  33. Thank Larry Wall • Perl 5 regexes offer the ability to embed code within a regex • Perl 6 will support recursive regexes

  34. Why the divide? • Very little theory has touched on extended regular expressions. • Backreferences are indispensable for many programmers, and often even in non-development use of *NIX systems

  35. Why the divide? • Developers implemented regular expression processors shortly after Kleene created regular expressions in the 50’s

  36. Why the divide? • New and more powerful features were quickly added to practical “regular expressions” so that users and programmers could express more languages • Regexes soon left theory in the dust

  37. Moral of the Story • It’s much easier to hack than to make a good proof

  38. The Future • Unbound backreferences are becoming a standard feature in regex libraries and languages • The idea of implementing regexes in a common module and sharing it among different languages and platforms is growing in popularity • PCRE(Perl-Compatible Regex Engine) is used by Python, PHP, Apache, KDE…

  39. The Future • Regex implementations seem to be moving towards more standardization • Meanwhile, a solid theoretical foundation has been laid down for regular expressions and modest extensions • Practice may not come to theory, but theory may soon come to practice

More Related