1 / 47

Pattern Matching on Strings using Regular Expressions

Pattern Matching on Strings using Regular Expressions. Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )*. Claus Brabrand [ brabrand@itu.dk ] IT University of Copenhagen. Jakob G. Thomsen [ gedefar@cs.au.dk ] Aarhus University. Outline.

aneveu
Download Presentation

Pattern Matching on Strings using Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching on Stringsusing Regular Expressions Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )* Claus Brabrand [ brabrand@itu.dk ] IT University of Copenhagen Jakob G. Thomsen [ gedefar@cs.au.dk ] Aarhus University

  2. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  3. Introduction & Motivation • Pattern matching an indispensable problem • Many applications need to "parse" dynamic input • 1) URLs: • 2) Log Files: • 3) DBLP: (list of key-value pairs) http://first.dk/index.php?id=141&view=details protocol host path query-string 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64 post /search.html <article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year> </article>

  4. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  5. The Chomsky Hierarchy (1956) • Language classes (+formalisms): • Type-3 regular expressions "enough" for: • URLs, log files, DBLP, ... • "Trade" (excess) expressivity for: • declarativity, simplicity, andstatic safety !

  6. Type-0: java.net.URL • Turing-Complete programming (e.g., Java) • [ "unrestricted grammars" (e.g., rewriting systems) ] • Cyclomatic complexity (of official "java.net.URL"): • 88 bug reports on Sun's Bug Repository ! • Bug reports span more than a decade !

  7. Type-1: Context-Sensitivity • Not widely used (or studied?) formalism • Presumeably because: • Restricts expressivity w/o offering extra safety? - ? -

  8. Type-2: Context-Free Grammars • Conceptually harder than regexps • Essentially (Type-3) Regular Expressions + recursion • The ultimate end-all scientific argument: • We d: (conjecture!) regexps 12 times more popular !

  9. Type-?: Regexp Capture Groups • Capturing groups (Perl, PHP, Java regex, ...): • Syntax: (i.e., in parentheses) • Back-references: • Syntax: (i.e., "index of" capturing group) • Beyond regularity !: • is non-regular • In fact, not even context-free !!!: • is non-context-free (R) \7 (a*)b\1 { anban | n0 } {    | , * } (.*).\1

  10. Type-?: Regexp Capture Groups • Interpretation with back-tracking: • NP-complete (exponential worst-case): :-( regexp "a?nan " vs. string "an " 1 minute 0.02 msecs 3.000.000:1 on strings of length 29 !!!

  11. Closure properties: Union Concatenation Iteration Restriction Intersection Complement ... Decidability properties: ... ... Containment: L(R) L(R') Ambiguity ... ... Type-3: Regular Expressions Simple ! Declarative ! Safe !

  12. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  13. Regular Expressions • Syntax: • Semantics: where: • L1 L2 is concatenation(i.e., { 1 2 | 1L1, 2L2 }) • L* = i0 Li where L0 = {  } and Li = L  Li-1

  14. Common Extensions (sugar) • Any character (aka, dot): • "." asc1|c2|...|cn, ci • Character ranges: • "[a-z]" asa|b|...|z • One-or-more regexps: • "R+" asRR* • Optional regexp: • "R?" as|R • Various repetitions; e.g.: • "R{2,3}"asRRR?

  15. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  16. Recording • Syntax: • "x " is a recording identifier • (it "remembers" the substring it matches) • Semantics: • Example (simplified emails): • Matching against string: yields: NB: cannot use DFAs / NFAs ! - only recognition (yes / no) - not how (i.e., "the structure") [a-z]+ "@" [a-z]+ ("." [a-z]+)* <user=><domain=> "obama@whitehouse.gov" domain = "whitehouse.gov" user = "obama" &

  17. Recording (structured) • Another example (with nested recordings): • Matching against string: • yields: <date= <day= [0-9]{2} > "/" <month= [0-9]{2} > "/" <year= [0-9]{4} > > "26/06/1992" date = 26/06/1992 date.day = 26 date.month = 06 date.year = 1992

  18. Recording (structured, lists) • Yet another example (yielding lists): • Matching against string: • yields a list structure: <name= [a-z]+ > " & " <name= [a-z]+ > ( <name= [a-z]+ > "\n" )* <name= [a-z]+ > (" & " <name= [a-z]+ > )* "obama & bush" name = [obama,bush]

  19. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  20. Abstract Syntax Trees (ASTs)

  21. R R'  T T'   = Ambiguity • Definition: • Rambiguousiff T,T'ASTR: T  T'  ||T|| = ||T'|| • where ||||: AST * (the flattening) is:

  22. Characterization of Ambiguity • Theorem: • Runambiguousiff NB: sound & complete ! R* =  | RR*

  23. Ambiguous: a|a L(a) L(a) = { a }  Ø a*a* L(a*) L(a*) = { an }  Ø Unambiguous: a|aa L(a) L(aa) = Ø a*ba* L(a*) L(ba*) = Ø     Examples

  24. Ambiguity Examples • a?b+|(ab)* • (a|ab)(ba|a) • (aa|aaa)* *** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab" *** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba" *** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"

  25. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  26. 1)Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3)Disambiguators: From characterization: concat: 'L','R' choice: '|L','|R' star: '*L','*R' (partial-order on ASTs) 2) Restriction: R1 - R2 And then encode...: RCas: * - R R1 & R2as:(R1C|R2C)C 4)Default disamb: concat, choice, and star are all left-biassed (by default) ! (Our tool does this) Disambiguation

  27. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  28. Type Inference • Type Inference: • R:(L,S)

  29. Examples (Type Inference) • Regexp: • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old");

  30. Examples (Type Inference) • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( $Person "\n" )* class People { // auto-generated String[]name; int[]age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48)\n bush (63)\n "; People p = People.match(s); println("Second name is " + p[1].name);

  31. Examples (Type Inference) • Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( <person= $Person >"\n" )* ; class People { // auto-generated Person[]person; class Person { // nested class String name; int age; } ... } compile (our tool) String s = "obama (48)\n bush (63)\n "; People people = People.match(s); for (p : people.person) println(p.name);

  32. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  33. URLs • URLs: • Regexp: • Query string further structured (list of key-value pairs): (list of key-value pairs) "http://www.google.com/search?q=record&hl=en" protocol host path query-string (list of key-value pairs) Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; Query = <query= [a-z&=]* > ; URL = "http://" $Host "/" $Path "?" $Query ; KeyVal = <key= [a-z]* >"="<val= [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ;

  34. URLs (Usage Example) • Regexp: • Usage (example): Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; KeyVal = <key= [a-z]* >"="<val= [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ; URL = "http://" $Host "/" $Path "?" $Query ; String s = "http://www.google.com/search?q=record"; URL url = URL.match(s); print("Host is: " + url.host); if (url.key.length>0) print("1st key: " + url.key[0]); for (String val : url.val) println("value = " + val);

  35. Log Files Format 13/02/2010 66.249.65.107 /support.html 20/02/2010 42.116.32.64 /search.html ... Date = <date= <day= $Day > "/" <month= $Month > "/" <year= [0-9]{4} >>; IP = <ip= [0-9]{1,3} ("." [0-9]{1,3} ){3} >; Entry = <entry= $Date " " $IP " " $Path "\n">; Log = $Entry * ; Regexp Log log = Log.match(log_file); for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip); Usage

  36. Log Files (cont'd, ambiguity) • Assume we forgot "/" (between day & month): • Ambiguity: • i.e. "1/01" (January 1) vs. "10/1" (January 10) :-) Regexp Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ; Month = 0?[1-9] | 10 | 11 | 12 ; Date = <date=<day=$Day>// no slash ! <month=$Month> "/" <year= [0-9]{4} > > ; Error *** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101"

  37. DBLP (Format) • DBLP (XML) Format: <article> <author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal> </article> <article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note> </article> ...

  38. DBLP (Regexp) • DBLP Regexp: • Ambiguity !: • EITHER 2 publications (.* = "") • OR 1 publication (.* = gray part) !!! Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub= $Article > * ; *** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>"

  39. DBLP (Disambiguated) • DBLP Regexp: • Disambiguated (using "(R1-R2)"): • Unambiguous! :-) Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub= $Article > * ; Article = "<article>" $Author* $Title (.*-(.* "</article>" .*)) "</article>" ;

  40. DBLP (Usage Example) • DBLP Regexp: • Usage (example): Author = "<author>" <author= [a-z]* > "</author>" ; Title = "<title>" <title= [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <article= $Article > * ; DBLP dblp = DBLP.match(readXMLfile("DBLP.xml")); for (Article a: dblp.article) print("Title: " + a.title);

  41. Outline • Pattern Matching (intro & motiv): • The Chomsky Hierarchy (1956) • Regular Expressions: • The Recording Construction • Ambiguity: • Disambiguation • Type Inference • Usage and Examples • Evaluation and Conclusion

  42. Evaluation • Evaluation summary: • Also, (Type-3) regexps expressive "enough" • for: URLs, Log files, DBLP, ... [ Frisch&Cardelli'04 ] [ NP-Complete ] [ MatMult ]

  43. Type-3 vs. Type-0 (URLs) • Regexps vs. Java: Regexps are 8 times more concise !

  44. java.util.regex vs. Our approach • Efficiency(on DBLP): • java.util.regex: • Exponential O(2||) 2,500 chars in 2 mins ! • In contrast; ours: • Linear (on DBLP) 1,200,000 chars in 6 secs ! 2 mins 10 msecs

  45. Related Work • Recording (with lists in general): • "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP • Ambiguity: • [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed) • Disambiguation: • [Vansummeren'06] but with global, not local disambiguation • Type inference: • Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)

  46. Conclusion • For string pattern matching, it is possible to: • In conclusion: • i.e., ambiguity checking and type inference ! • + stand-alone &non-intrusive language integration (Java) ! "trade (excess) expressivity for safety+simplicity" We conclude that ifregular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner.

  47. </Talk> [ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ] Questions ? Complaints ?

More Related