1 / 42

Computational Linguistics Introduction

Computational Linguistics Introduction. Finite State Machinery and Language Description. Acknowledgement. The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001. Today’s Topics.

levana
Download Presentation

Computational Linguistics Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Linguistics Introduction Finite State Machinery and Language Description CLINT-LIN: Finite State Machinery

  2. Acknowledgement • The material for this lecture is derived from a series of talks given by • Dr. Ken Beesley (Xerox European Research Centre, Grenoble) • in Malta, 2001. CLINT-LIN: Finite State Machinery

  3. Today’s Topics • Finite State Technology • Regular Languages and Relations • Review of Set Theory • Understand the mathematical operations that can be performed on such Languages. • Understand how Languages, Relations, Regular Expressions, and Networks are interrelated. CLINT-LIN: Finite State Machinery

  4. What is Finite State Technology? • Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems. • Such Techniques include • Design of user languages for specifying FSA • Compilation of such languages into efficient transition networks. • Development environments and runtime systems CLINT-LIN: Finite State Machinery

  5. What is Finite-State Technology Good For? • Finite-state techniques cannot handle central embedding • the man the dog the cat bit followed ate. • They are well suited to “lower-level” natural language processing such as • Tokenization – what is the next word? • Spelling error detection: does the next word belong to a list? • Morphological/phonological analysis/generation • Shallow syntactic parsing and “chunking” CLINT-LIN: Finite State Machinery

  6. Tokenisation Problems • VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London) • VfB Stuttgart, Manchester United • succession • 2-1 • Wednesday • Finite state techniques provide a means to specify the language of words, thus defining what it means to be the next token. • There are three ways to specify such languages CLINT-LIN: Finite State Machinery

  7. MACHINE Languages,Notations and Machines LANGUAGE (set of strings) NOTATION CLINT-LIN: Finite State Machinery

  8. FINITE STATE AUTOMATON Languages,Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION CLINT-LIN: Finite State Machinery

  9. FINITE STATE AUTOMATA:preliminary definition • A finite state automaton includes: • A finite set of states • A finite set of labelled transitions between states CLINT-LIN: Finite State Machinery

  10. Physical Machines with Finite States • The Lightswitch Machine UP OFF ON DOWN CLINT-LIN: Finite State Machinery

  11. Physical Machines with Finite States • The Lightswitch Toggle Machine PUSH OFF ON PUSH CLINT-LIN: Finite State Machinery

  12. The Five Cent Machine • Problem: • Assume you have one, two, and five cent pieces • Design a finite state automaton which accepts exactly 5 cents. CLINT-LIN: Finite State Machinery

  13. The Cola Machine • Need to enter 25 cents (USA) to get a drink • Accepts the following coins: • Nickel = 5 cents • Dime = 10 cents • Quarter = 25 cents • For simplicity, our machine needs exact change • We will model only the coin-accepting mechanism CLINT-LIN: Finite State Machinery

  14. Physical Machines with Finite States • The Cola Machine Start State Final State N N N N N 5 10 15 20 25 0 D D D D Q CLINT-LIN: Finite State Machinery

  15. The Cola Machine Language • List of all the sequences of coins accepted: • { Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN } • Think of the coins as SYMBOLS or CHARACTERS • The set of symbols accepted is the ALPHABET of the machine • Think of sequences of coins as WORDS or “strings” • The set of words accepted by the machine is its LANGUAGE CLINT-LIN: Finite State Machinery

  16. FINITE STATE AUTOMATA:better definition • A finite state automaton includes: • A finite set of states • Initial State • Final State (s) • A finite set of labelled transitions beween states • Labels are symbols from an alphabet • Recognises a language • Generates a language as well! CLINT-LIN: Finite State Machinery

  17. A Network that Accepts aOne Word Language c n a t o CLINT-LIN: Finite State Machinery

  18. A Network that Accepts aThree Word Language a n t o c t g i r e m a e s CLINT-LIN: Finite State Machinery

  19. Scaling Up the Network • Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language. • We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language • This is the basis for a Spanish spelling error detector. CLINT-LIN: Finite State Machinery

  20. Looking Up a Word a n t o c t g i r e m a e s “Apply” m e s a CLINT-LIN: Finite State Machinery

  21. Lookup Failure • Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because: • Not all input is consumed ("libro", "tigra") • Input is fully consumed but state is not final ("cant") • Final state is reached but there is still unconsumed output ("mesas") CLINT-LIN: Finite State Machinery

  22. Shared Structure c l e a r v e e CLINT-LIN: Finite State Machinery

  23. Transducers “Lookdown” mesa+Noun+Fem+Pl m e s a +Noun +Fem +Pl m e s a 0 0 s “Lookup” m e s a s CLINT-LIN: Finite State Machinery

  24. A Morphological Analyzer dog +n +pl Transducer dogs CLINT-LIN: Finite State Machinery

  25. A Morphological Analyzer Lexical Language Transducer Surface Language CLINT-LIN: Finite State Machinery

  26. A Quick Review of Set Theory • A set is a collection of objects. B A E D We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc. CLINT-LIN: Finite State Machinery

  27. Uniqueness of Elements • You cannot have two or more • ‘A’ elements in the same set B A D E { A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }. CLINT-LIN: Finite State Machinery

  28. Cardinality of Sets • The Empty Set: • A Finite Set: • An Infinite Set: e.g. The Set of all Positive Integers Norway Denmark Sweden CLINT-LIN: Finite State Machinery

  29. Simple Operations on Sets: Union A B D E C Set 1 Set 2 B C A D E Union of Set1 and Set 2 CLINT-LIN: Finite State Machinery

  30. Simple Operations on Sets (2): Union A B C D C Set 1 Set 2 B C A D Union of Set1 and Set 2 CLINT-LIN: Finite State Machinery

  31. Simple Operations on Sets (3): Intersection A B C D C Set 1 Set 2 C Intersection of Set1 and Set 2 CLINT-LIN: Finite State Machinery

  32. Simple Operations on Sets (4): Subtraction A B C D C Set 1 Set 2 A B Set 1 minus Set 2 CLINT-LIN: Finite State Machinery

  33. Formal Languages Very Important Concept in Formal Language Theory: A Language is just a Set of Words. • We use the terms “word” and “string” interchangeably. • A Language can be empty, have finite cardinality, or be infinite in size. • You can union, intersect and subtract languages, just like any other sets. CLINT-LIN: Finite State Machinery

  34. Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog cat rat elephant mouse Union of Language 1 and Language 2 CLINT-LIN: Finite State Machinery

  35. Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection of Language 1 and Language 2 CLINT-LIN: Finite State Machinery

  36. Intersection of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 rat Intersection of Language 1 and Language 2 CLINT-LIN: Finite State Machinery

  37. Subtraction of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 dog cat Language 1 minus Language 2 CLINT-LIN: Finite State Machinery

  38. Languages • A language is a set of words (=strings). • Words (strings) are composed of symbols (letters) that are “concatenated” together. • At another level, words are composed of “morphemes”. • In most natural languages, we concatenate morphemes together to form whole words. For sets consisting of words (i.e. for Languages), the operation of concatenation is very important. CLINT-LIN: Finite State Machinery

  39. Concatenation of Languages work talk walk 0 ing ed s Root Language Suffix Language The concatenation of the Suffix language after the Root language. work working worked works talk talking talked talks walk walking walked walks CLINT-LIN: Finite State Machinery

  40. Languages and Networks 0 t a s w a l k s s o i r n g e Network/Language 1 d Network/Language 2 0 a t s w a l k The concatenation of Network 1 and Network 2 s i o n g r e d CLINT-LIN: Finite State Machinery

  41. Why is “Finite State” Computing so Interesting? • Finite-state systems are mathematically elegant, easily manipulated and modifiable. • Computationally efficient. Usually very compact. • The programming we linguists do is declarative. We describe the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code. • The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent. • Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate. CLINT-LIN: Finite State Machinery

  42. FINITESTATE MACHINE Languages,Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION CLINT-LIN: Finite State Machinery

More Related