1 / 70

ICS 482 Natural Language Processing Regular Expression and Finite Automata

ICS 482 Natural Language Processing Regular Expression and Finite Automata. Muhammed Al-Mulhem March 1, 2009. Regular Expressions. Regular expression (RE): A formula for specifying a set of strings.

jayden
Download Presentation

ICS 482 Natural Language Processing Regular Expression and Finite Automata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 482Natural Language ProcessingRegular Expression andFinite Automata Muhammed Al-Mulhem March 1, 2009 Dr. Muhammed Al-mulhem

  2. Regular Expressions Regular expression (RE): A formula for specifying a set of strings. String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation). Dr. Muhammed Al-mulhem

  3. Regular Expression Patterns Dr. Muhammed Al-mulhem

  4. Specify Options and Range using [ ] and - Dr. Muhammed Al-mulhem

  5. RE Operators Dr. Muhammed Al-mulhem

  6. Sidebar: Errors Find all instances of the word “the” in a text. /the/ What About ‘The’ /[tT]he/ What about ‘Theater”, ‘Another’ Dr. Muhammed Al-mulhem

  7. Sidebar: Errors The process we just went through was based on: Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives Dr. Muhammed Al-mulhem

  8. Sidebar: Errors Reducing the error rate for an application often involves two efforts Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives) Dr. Muhammed Al-mulhem

  9. Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ?, + and * Wild cards . Anchors \b and \B Disjunction, grouping, and precedence | Dr. Muhammed Al-mulhem

  10. Preceding character or nothing using ? Dr. Muhammed Al-mulhem

  11. Wildcard 9/28/2014 11 Dr. Muhammed Al-mulhem

  12. Negation using ^ 9/28/2014 12 Dr. Muhammed Al-mulhem

  13. Writing correct expressions Exercise: write a regular expression to match the English article “the”: /the/ missed ‘The’ /[tT]he/ included ‘the’ in ‘others’ /\b[tT]he\b/ Missed ‘the25’ ‘the_’ /[^a-zA-Z][tT]he[^a-zA-Z]/ Missed ‘The’ at the beginning of a line Dr. Muhammed Al-mulhem

  14. A more complex example Exercise: Write a Perl regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: Dr. Muhammed Al-mulhem

  15. Example Price /$[0-9]+/ # whole dollars /$[0-9]+\.[0-9][0-9]/ # dollars and cents /$[0-9]+(\.[0-9][0-9])?/ #cents optional /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries Specifications for processor speed /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ Memory size /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ Vendors /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ /\b(Mac|Macintosh|Apple)\b/ Dr. Muhammed Al-mulhem

  16. Advanced Operators – Aliases for common ranges Underscore: Correct figure 2.6 Dr. Muhammed Al-mulhem

  17. \ to Reference special characters Dr. Muhammed Al-mulhem

  18. Operators for counting Dr. Muhammed Al-mulhem

  19. Finite State Automata FSA recognizes the regular languages represented by regular expressions SheepTalk: /baa+!/ a b a a ! q0 q1 q2 q3 q4 • Directed graph with labeled nodes and arc transitions • Five states: q0 the start state, q4 the final state, 5 transitions Dr. Muhammed Al-mulhem

  20. Formally FSA is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!} q0: A start state F: a set of final states in Q {q4} (q,i): a transition function mapping Q x  to Q a b a a ! q0 q1 q2 q3 q4 Dr. Muhammed Al-mulhem

  21. FSA recognizes (accepts) strings of a regular language baa! baaa! baaaa! … A rejected input a b a a ! q0 q1 q2 q3 q4 Dr. Muhammed Al-mulhem

  22. State Transition Table a b a a ! q0 q1 q2 q3 q4 FSA can be represented with State Transition Table Dr. Muhammed Al-mulhem

  23. Non-Deterministic FSAs for SheepTalk b a a ! a q0 q1 q2 q3 q4 b a a ! q0 q1 q2 q3 q4  Dr. Muhammed Al-mulhem

  24. A language is a set of strings String:A sequence of letters Languages Dr. Muhammed Al-mulhem

  25. Tracing FSA - Initial Configuration Input String Dr. Muhammed Al-mulhem

  26. Reading the Input Dr. Muhammed Al-mulhem

  27. Dr. Muhammed Al-mulhem

  28. Dr. Muhammed Al-mulhem

  29. Dr. Muhammed Al-mulhem

  30. Output: “accept” Dr. Muhammed Al-mulhem

  31. Rejection Dr. Muhammed Al-mulhem

  32. Dr. Muhammed Al-mulhem

  33. Dr. Muhammed Al-mulhem

  34. Dr. Muhammed Al-mulhem

  35. Output: “reject” Dr. Muhammed Al-mulhem

  36. Another Example Dr. Muhammed Al-mulhem

  37. Dr. Muhammed Al-mulhem

  38. Dr. Muhammed Al-mulhem

  39. Dr. Muhammed Al-mulhem

  40. Output: “accept” Dr. Muhammed Al-mulhem

  41. Rejection Dr. Muhammed Al-mulhem

  42. Dr. Muhammed Al-mulhem

  43. Dr. Muhammed Al-mulhem

  44. Dr. Muhammed Al-mulhem

  45. Output: “reject” Dr. Muhammed Al-mulhem

  46. Formalities Deterministic Finite Accepter (DFA) : set of states : input alphabet : transition function : initial state : set of final states Dr. Muhammed Al-mulhem

  47. About Alphabets Alphabets means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure. Dr. Muhammed Al-mulhem

  48. Input Aplhabet Dr. Muhammed Al-mulhem

  49. Set of States Dr. Muhammed Al-mulhem

  50. Initial State Dr. Muhammed Al-mulhem

More Related