Introduction to Computational Linguistics - PowerPoint PPT Presentation

introduction to computational linguistics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Computational Linguistics PowerPoint Presentation
Download Presentation
Introduction to Computational Linguistics

play fullscreen
1 / 60
Introduction to Computational Linguistics
75 Views
Download Presentation
Download Presentation

Introduction to Computational Linguistics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Introduction toComputational Linguistics Words and Finite State Machinery CLINT-CS Finite State

  2. Acknowledgement Material derived from/copied from • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 • Richard Sproat, Lecture notes CLINT-CS Finite State

  3. Finite State Methods • Word-Oriented Application Areas • Tokenization • Sentence breaking • Spelling correction • Morphology (analysis/generation) • Phonological disambiguation (Speech Recognition) • Morphological disambiguation (“Tagging”) • Pattern matching (“Named Entity Recognition”) • Shallow Parsing CLINT-CS Finite State

  4. Outline Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State

  5. What is a Word? Some Distinctions • Written • Spoken • Word Type • Word Token CLINT-CS Finite State

  6. Information Associated with Words • Spelling • orthographic • phonological • Syntax • POS • Valency • Semantics • Meaning • Relationship to other words CLINT-CS Finite State

  7. Properties of Words • Sequence • characters pollution • phonemes • Delimitation • whitespace • other? • Structure • simple ("atomic") words • complex ("molecular") words CLINT-CS Finite State

  8. Complex Words • Complex words have subparts: • e.g. "enlargement"en + large + ment • Some subparts are valid wordslarge • Others are prefixes and suffixesen, ment • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment) CLINT-CS Finite State

  9. Morphological Processes • affixation • prefix • suffix • circumfix: għandi - mgħandix • infix: phenidinephenetidine • other morphological processes • redoubling (mexa; mexxa) • vowel change (swim; swam) CLINT-CS Finite State

  10. Affixation uses Concatenation prefixes roots suffixes large charge infect code decide ed ing ee er ly dis re un en + + CLINT-CS Finite State

  11. The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of • A characteristic set of basic symbols (alphabet) • A characteristic set of combining operations • Union (disjunction) • Concatenation • Iteration • Regular Language; Regular Sets CLINT-CS Finite State

  12. MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State

  13. Outline Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State

  14. Regular Languages • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star) CLINT-CS Finite State

  15. Some things that areregular languages • Zero or more a’s followed by zero or more b’s • The set of words in an English dictionary • Dates • URLs • English? CLINT-CS Finite State

  16. Some things that are not regular languages • Zero or more a’s followed by exactly the same number of b’s • The set of all English palindromes (e.g. Madam I'm Adam) • The set that includes all noun phrases of the form • the cat slept • the cat the dog bit slept • the cat the dog the man fed bit slept CLINT-CS Finite State

  17. Some special regular languages • The universal language (Σ*) • The empty language (Ø) Note: the empty language is not the same as the empty string CLINT-CS Finite State

  18. Some closure propertiesof regular languages • Intersection • Complementation • Difference • Reversal • Power CLINT-CS Finite State

  19. MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State

  20. Outline Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State

  21. Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. CLINT-CS Finite State

  22. Regular Expressions a a simple symbol A B concatenation A | B alternation operator A & B intersection operator A* Kleene star CLINT-CS Finite State

  23. Caveats • Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages. • For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …) /(…+)\1/ CLINT-CS Finite State

  24. MACHINE Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State

  25. Outline Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State

  26. Finite Automaton • A finite automaton is a quintuple (Q, I, q0,F, δ ) where: • Q is a finite set of states • Σ is alphabet of symbols • q0  Q is a start state • F  Q are final states • δ is a transition relationδ(q,i,q') between a state q  Q, a symbol σ Σand q'  Q CLINT-CS Finite State

  27. Representation of FSA’s:State Diagram CLINT-CS Finite State

  28. State Table CLINT-CS Finite State

  29. 1- h 2 a h 3 ! 4= Prolog initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h). CLINT-CS Finite State

  30. Mr. S.K. CLINT-CS Finite State

  31. Kleene’s theorem • Languages generated by NFAs are exactly equivalent languages described by Regular Expressions. • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression. CLINT-CS Finite State

  32. Converting a Regular Expressionto an NFA • The NFA representing the empty string is: • The NFA representing a single character is: ε 1 2 a 1 2 CLINT-CS Finite State

  33. Converting a Regular Expressionto an NFA • The union operator is represented by a choice of paths from a node, e.g. a|b b 1 2 a CLINT-CS Finite State

  34. Converting a Regular Expressionto an NFA • Concatenation simply involves connecting one NFA to the other, so that ab is represented by a b 1 2 3 CLINT-CS Finite State

  35. Converting a Regular Expressionto an NFA • The Kleene star must allow for zero or more occurrences. So a* is represented by ε a ε 1 2 3 3 ε ε CLINT-CS Finite State

  36. Deterministic versus non-deterministic finite automata • The definition of finite-state automata given above was for non-deterministic finite automata (NFA): • δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states. • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function CLINT-CS Finite State

  37. A deterministic automaton CLINT-CS Finite State

  38. NFAs vs DFAs • NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? CLINT-CS Finite State

  39. NFAs vs DFAs • NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? • EFFICIENCY! CLINT-CS Finite State

  40. Equivalence of NFA’s and DFA’s CLINT-CS Finite State

  41. Subset Construction for Determinisation • Any two states that are connected by an εtransition may as well be the same, since we can move from one to the other without consuming any character. • Thus states which are connected by an εtransition will be represented by the same states in the DFA. • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). • Thus these states will be combined into a single DFA state. • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html CLINT-CS Finite State

  42. Xerox Tools Finite State Machinery CLINT-CS Finite State

  43. The Xerox Approach • Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. • Meta-languages for describing regular languages and regular relations. • Compiler for mapping meta-language "programs" into efficient FS machinery • Several tools and applications CLINT-CS Finite State

  44. xerox tools • xfstXerox Finite-State Tool • lexcFinite-State Lexicon Compiler • twolcTwo-Level Rule Compiler CLINT-CS Finite State

  45. xfst • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.) CLINT-CS Finite State

  46. Simple Regular Expressions • Atomic Expressions • Simple Symbols • Multicharacter Symbols • Complex Expressions • Union • Intersection • Concatenation CLINT-CS Finite State

  47. xfst Notation Examples A|B Union A&B Intersection A B Concatenation A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ] CLINT-CS Finite State

  48. Regular Expression E1: = [a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"} Concatenation over Reg. Expression and Language CLINT-CS Finite State

  49. Concatenation overFS Automata a c + b d a c = b d CLINT-CS Finite State

  50. Simple Commands • In addition to the notation there are also commands, e.g. • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction • various command line options CLINT-CS Finite State