1 / 39

CSA3050: Natural Language Algorithms

CSA3050: Natural Language Algorithms. Words and Finite State Machinery. Acknowledgement. Material derived from/copied from Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 Richard Sproat, Lecture notes. Outline. Words Regular Languages Regular Expressions

rafe
Download Presentation

CSA3050: Natural Language Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA3050: Natural Language Algorithms Words and Finite State Machinery Natural Language Processing

  2. Acknowledgement Material derived from/copied from • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 • Richard Sproat, Lecture notes Natural Language Processing

  3. Outline Words Regular Languages Regular Expressions Finite State Automata Natural Language Processing

  4. What is a Word? • A series of speech sounds that symbolizes meaning without being divisible into smaller units • Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark • A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements • The smallest meaningful element of language. When written it stands alone with a space on either side of it. Natural Language Processing

  5. Information Associated with Words • Spelling • orthographic • phonological • Syntax • POS • Valency • Semantics • Meaning • Relationship to other words Natural Language Processing

  6. Properties of Words • Sequence • characters pollution • phonemes • Delimitation • whitespace • other? • Structure • simple ("atomic") words • complex ("molecular") words Natural Language Processing

  7. Complex Words • Complex words have subparts: • e.g. "enlargement"en + large + ment • Some subparts are valid wordslarge • Others are prefixes and suffixesen, ment • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment) Natural Language Processing

  8. Morphological Processes • affixation • prefix • suffix • circumfix: għandi - mgħandix • infix: phenidinephenetidine • other morphological processes • redoubling (mexa; mexxa) • vowel change (swim; swam) Natural Language Processing

  9. Complex Words Formed by Concatenation prefixes roots suffixes large charge infect code decide ed ing ee er ly dis re un en + + Natural Language Processing

  10. The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of • A characteristic set of basic symbols (alphabet) • A characteristic set of combining operations • Union (disjunction) • Concatenation • Iteration • Regular Language; Regular Sets Natural Language Processing

  11. Outline Words Regular Languages Regular Expressions Finite State Automota Natural Language Processing

  12. Regular Languages • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star) Natural Language Processing

  13. Some things that areregular languages • Zero or more a’s followed by zero or more b’s • The set of words in an English dictionary • Dates • URLs • English? Natural Language Processing

  14. Some things that are not regular languages • Zero or more a’s followed by exactly the same number of b’s • The set of all English palindromes (e.g. Madam I'm Adam) • The set that includes all noun phrases of the form • the cat slept • the cat the dog bit slept • the cat the dog the man fed bit slept Natural Language Processing

  15. Some special regular languages • The universal language (Σ*) • The empty language (Ø) Note: the empty language is not the same as the empty string Natural Language Processing

  16. Some closure propertiesof regular languages • Intersection • Complementation • Difference • Reversal • Power Natural Language Processing

  17. MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION Natural Language Processing

  18. Outline Words Regular Languages Regular Expressions Finite Automota Natural Language Processing

  19. Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. Natural Language Processing

  20. Regular Expressions a a simple symbol A B concatenation A | B alternation operator A & B intersection operator A* Kleene star Natural Language Processing

  21. MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION Natural Language Processing

  22. Outline Words Regular Languages Regular Expressions Finite Automata Natural Language Processing

  23. Finite Automaton • A finite automaton is a quintuple (Q, I, q0,F, δ ) where: • Q is a finite set of states • Σ is alphabet of symbols • q0  Q is a start state • F  Q are final states • δ is a transition relationδ(q,i,q') between a state q  Q, a symbol σ Σand q'  Q Natural Language Processing

  24. Representation of FSA’s:State Diagram Natural Language Processing

  25. State Table Natural Language Processing

  26. Mr. Kleene Natural Language Processing

  27. Kleene’s theorem • Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions. • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression. http://www.cs.may.ie/~jpower/Courses/parsing/node6.html Natural Language Processing

  28. Converting a Regular Expressionto an NFA • The NFA representing the empty string is: • The NFA representing a single character is: ε 1 2 a 1 2 Natural Language Processing

  29. Regular Expression to NFA Diagram from Leonidas Fegaras, Univ. Texas Natural Language Processing

  30. Deterministic Finite Automata • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function • Why do we care about DFAs? Natural Language Processing

  31. Deterministic Finite Automata • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function • Why do we care about DFAs? • EFFICIENCY!! Natural Language Processing

  32. Equivalence of NFA’s and DFA’s Natural Language Processing

  33. Subset Construction for Determinisation • States which are connected by an εtransition will be represented by the same states in the DFA. • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). • Thus these states will be combined into a single DFA state. • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html Natural Language Processing

  34. Subset construction for determinization Natural Language Processing

  35. Subset construction for determinization Natural Language Processing

  36. Subset construction for determinization Natural Language Processing

  37. Subset construction for determinization Natural Language Processing

  38. Subset construction for determinization Natural Language Processing

  39. Subset construction for determinization Natural Language Processing

More Related