1 / 59

Natural Language Processing

Natural Language Processing. Finite State Transducers. Acknowledgement. Material in this lecture derived/copied from Richard Sproat CL46 Lectures Lauri Karttunen LSA lectures 2005. Three Key Concepts. Regular Relations. Finite State Transducers. Computational Morphology.

emcduffie
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Finite State Transducers

  2. Acknowledgement • Material in this lecture derived/copied from • Richard Sproat CL46 Lectures • Lauri Karttunen LSA lectures 2005 CSA3180 NLP

  3. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP

  4. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP

  5. A Regular Set L1 ab abab ababab abababab ababababab . . CSA3180 NLP

  6. Two Regular Sets L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . CSA3180 NLP

  7. A Regular Relation  L1 x L2 L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . or {("ab","ba"), ("abab","baba"),...} CSA3180 NLP

  8. Some closure properties for regular relations • Concatenation [R1 R2] • Power (Rn) • Reversal • Inversion (R-1) • Composition: R1 ○ R2 CSA3180 NLP

  9. Concatenation and Power Concatenation R1 = {("a","b")} R2 = {("c","d")} [R1 R2] = {("ac","bd")} Power R1+ = {("a","b"),("aa","bb"), ("aaa","bbb"), ...} CSA3180 NLP

  10. Composition • R1 ○ R2 denotes the composition of relations R1 and R2. • Definition If R1 contains <x,y> And R2 contains <y,z> Then R1 ○ R2 contains <x,z> • R1 and R2 and B must be relations. If either is just a language, it is assumed to abbreviate the identity relation. • R1 ○ R2 is written [R1 .o. R2] in xfst 61 CSA3180 NLP

  11. Closure Properties of Regular Languages and Relations Operation Regular Languages Regular Relations Union yes yes Concatenation yes yes Iteration yes yes Intersection yes no Subtraction yes no Complementation yes no Composition n/a yes CSA3180 NLP

  12. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP

  13. Morphology • Study of word structure and formation in terms of morphemes • Morpheme: smallest independent unit in a word that bear some meaning, such as rabbit and s. • Two kinds of morphology • Inflectional • Derivational CSA3180 NLP

  14. Inflectional+s plural+ed past category preserving productive: always applies (esp. new words, e.g. fax) systematic: same semantic effect Derivational+ment category changingallot+ment not completely productive: detractment* not completely systematic: department Inflectional/DerivationalMorphology CSA3180 NLP

  15. Example: English Noun Inflections CSA3180 NLP

  16. Analysis Generation hang V Past leaf N Pl leave N Pl leave V Sg3 leaves hanged hung Computational morphology CSA3180 NLP

  17. Two Challenges for Computational Morphology • Handling Morphotactics • Words are composed of smaller elements that must be combined in a certain order: • piti-less-ness is English • piti-ness-less is not English • Handling Phonological Alternations • The shape of an element may vary depending on the context • pity is realized as pitiin pitilessness • die becomes dy in dying CSA3180 NLP

  18. Morphological Parsing • The goal of morphological parsing is to find out what morphemes a given word is built from. cats catNoun PL mice mouseNoun PL foxes fox N PL foxes fox V 3SING PRES • Two kinds of information • Lexeme (which dictionary word) • Morphosyntactic information CSA3180 NLP

  19. Morphological Parsing Output Analysis cat N PL Input Word cats Morphological Parser • Issues • Non-determinism • Arbitrary choice of morpheme symbols CSA3180 NLP

  20. Morphological Generation Input Analysis cat N PL Output Word cats Morphological Generator • Issues • Reversibility: how much can be shared • regardless of direction? CSA3180 NLP

  21. Morphology as a Regular Relation lexical language surface language cat cats mice lives . . . cat cat+N+PL mouse+N+PL life+N+PL live+V+3SING . . or {("cat,cat"),("cats","cat+N+PL"),......} CSA3180 NLP

  22. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP

  23. FSA a • Used for • Recognition • Generation CSA3180 NLP

  24. Finite State Transducers • A finite state transducer (FST) is essentially an FSA finite state automaton that works on two (or more) tapes. • The most common way to think about transducers is as a kind of translating machine which works by reading from one tape and writing onto the other. CSA3180 NLP

  25. FST Definition • A 2 way FST is a quintuple (K,s,F,ixo,) where • i, o are input and output alphabets • K is a finite set of states • s  K is an initial state • FK are final states •  is a transition relation of typeK x i x o x K CSA3180 NLP

  26. FST a upper tape • Used for • Recognition • Generation • Translation lower tape b CSA3180 NLP

  27. A Very Simple Transducer a b Relation { ("a","b") } Notation a:b encodes the transition CSA3180 NLP

  28. A Very Simple Transducer a b also written as a:b CSA3180 NLP

  29. a Symbol Pairs • Symbols vs. symbol pairs • In general, no distinction is made in xfst betweena the language {“a”}a:a the identity relation {(“a”, “a”)} CSA3180 NLP

  30. A Very Simple Transducer upper side a b lower side a:b CSA3180 NLP

  31. Relation { ("a","b"), ("aa","bb"), ...} Notationa:b* N.B. with this notation a and b must be single symbols A (more interesting) Transducer a:b 1 CSA3180 NLP

  32. Transducer have SeveralModes of Operation • generation mode: It writes on both tapes. A string of as on one tape and a string of bs on the other tape. Both strings have the same length. • recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs. • translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape. • translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape. CSA3180 NLP

  33. The Basic Idea • Morphology is regular • Morphology is finite state CSA3180 NLP

  34. Morphology is Regular • The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation, e.g.{ ("leaf+N+Pl","leaves"),("hang+V+Past","hung"),...} • Regular relations are closed under operations such as concatenation, iteration, union, and composition. • Complex regular relations can be derived from simpler relations. CSA3180 NLP

  35. Morphology is finite-state • A regular relation can be defined using the metalanguage of regular expressions. • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; • A regular expression can be compiled into a finite-state transducer that implements the relation computationally. CSA3180 NLP

  36. Finite-state transducer +Base: final state +3rdSg:s a t +Progr:i :n :g a l k w o r +Past:e :d initial state Compilation Regular expression • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; CSA3180 NLP

  37. Generation work+3rdSg --> works +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d CSA3180 NLP

  38. Analysis +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d talked --> talk+Past CSA3180 NLP

  39. XFST Demo 2 start xfst % xfst xfst[0]: • xfst[0]: regex • [{talk} | {walk} | {work}] • [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; compile a regular expression xfst[1]: apply up walked walk+Past apply the result xfst[1]: apply down talk+SgGen3 talks CSA3180 NLP

  40. vouloir +IndP +SG + P3 Finite-state transducer veut citation form inflection codes v o u l o i r +IndP +SG +P3 v e u t inflected form Lexical transducer • Bidirectional: generation or analysis • Compact and fast • Comprehensive systems have been built for over 40 languages: • English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, … CSA3180 NLP

  41. Morphotactics Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Alternations f a t +Adj +Comp t e f a t r How lexical transducers are made CSA3180 NLP

  42. fst 1 fst 2 fst n Sequential Model Lexical form Ordered sequence of rewrite rules (Chomsky & Halle ‘68) can be modeled by a cascade of finite-state transducers Johnson ‘72 Kaplan & Kay ‘81 Intermediate form ... Surface form CSA3180 NLP

  43. Sequential application k a N p a n k a m p a n k a m m a n N -> m / _ p p -> m / m _ CSA3180 NLP

  44. N:m 2 p m N:m ? m 0 ? p 1 N N m p 1 ? m 0 ? p:m Sequential application in detail k a N p a n k a m p a n k a m m a n 0 0 0 2 0 0 0 0 0 0 1 0 0 0 CSA3180 NLP

  45. 3 0 1 2 Composition N:m p:m k a N p a n k a m m a n N:m m 0 0 0 3 0 0 0 m ? ? p p:m N:m m N ? N N CSA3180 NLP

  46. fst n Parallel Model Lexical form ... fst 2 fst 1 Surface form Set of parallel of two-level rules (constraints) compiled into finite-state automata interpreted as transducers Koskenniemi ‘83 CSA3180 NLP

  47. Koskenniemi 1983 Chomsky&Halle 1968 Lexical form Lexical form rule 1 rule 1 ... rule 2 rule 1 rule n Intermediate form Surface form intersect ... FST rule n Surface form Sequential vs. parallel rules compose CSA3180 NLP

  48. Rewrite Rules vs. Constraints • Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated. • One approach may be more convenient than the other for particular applications. CSA3180 NLP

  49. b:y a:x c:0 Crossproduct • A .x. B The relation that maps every string in A to every string in B, and vice versa • A:B Same as [A .x. B]. a b c .x. x y [a b c] : [x y] {abc}:{xy} CSA3180 NLP

  50. b a c b:B a:A c:C b:B c:C a:A d:D Composition • A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z. {abc} .o. [a:A | b:B | c:C | d:D]* CSA3180 NLP

More Related