1 / 54

Natural Language Processing

Natural Language Processing. Finite State Transducers. Acknowledgement. Material in this lecture derived/copied from Richard Sproat CL46 Lectures Lauri Karttunen LSA lectures 2005. Three Key Concepts. Regular Relations. Finite State Transducers. Computational Morphology.

Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Finite State Transducers

  2. Acknowledgement • Material in this lecture derived/copied from • Richard Sproat CL46 Lectures • Lauri Karttunen LSA lectures 2005 CLINT-CL

  3. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CLINT-CL

  4. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CLINT-CL

  5. A Regular Set L1 ab abab ababab abababab ababababab . . CLINT-CL

  6. Two Regular Sets L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . CLINT-CL

  7. A Regular Relation  L1 x L2 L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . or {("ab","ba"), ("abab","baba"),...} CLINT-CL

  8. Concatenation and Power Concatenation R1 = {("a","b")} R2 = {("c","d")} [R1 R2] = {("ac","bd")} Power R1+ = {("a","b"),("aa","bb"), ("aaa","bbb"), ...} CLINT-CL

  9. Composition • R1 ○ R2 denotes the composition of relations R1 and R2. • Definition If R1 contains <x,y> And R2 contains <y,z> Then R1 ○ R2 contains <x,z> • R1 and R2 and B must be relations. If either is just a language, it is assumed to abbreviate the identity relation. • R1 ○ R2 is written [R1 .o. R2] in xfst 61 CLINT-CL

  10. Closure Properties of Regular Languages and Relations Operation Regular Languages Regular Relations Union yes yes Concatenation yes yes Iteration yes yes Intersection yes no Subtraction yes no Complementation yes no Composition n/a yes CLINT-CL

  11. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CLINT-CL

  12. Morphology • Study of word structure and formation in terms of morphemes • Morpheme: smallest independent unit in a word that bear some meaning, such as rabbit and s. • Two kinds of morphology • Inflectional • Derivational CLINT-CL

  13. Inflectional+s plural+ed past category preserving productive: always applies (esp. new words, e.g. fax) systematic: same semantic effect Derivational+ment category changingallot+ment not completely productive: detractment* not completely systematic: department Inflectional/DerivationalMorphology CLINT-CL

  14. Inflectional Morphology ExampleEnglish Noun Inflections CLINT-CL

  15. Analysis Generation hang V Past leaf N Pl leave N Pl leave V Sg3 leaves hanged hung Computational Morphology: Concrete Processes CLINT-CL

  16. Two Challenges for Computational Morphology • Handling Morphotactics • Words are composed of smaller elements that must be combined in a certain order: • piti-less-ness is English • piti-ness-less is not English • Handling Phonological Alternations • The shape of an element may vary depending on the context • pity is realized as pitiin pitilessness • die becomes dy in dying CLINT-CL

  17. Morphological Parsing • The goal of morphological parsing is to find out what morphemes a given word is built from. cats catNoun PL mice mouseNoun PL foxes fox N PL foxes fox V 3SING PRES • Two kinds of information • Lexeme (which dictionary word) • Morphosyntactic information CLINT-CL

  18. Morphological Parsing Output Analysis cat N PL Input Word cats Morphological Parser • Issues • Non-determinism • Arbitrary choice of morpheme symbols CLINT-CL

  19. Morphological Generation Input Analysis cat N PL Output Word cats Morphological Generator • Issues • Reversibility: how much can be shared • regardless of direction? CLINT-CL

  20. Morphology as a Regular Relation lexical language surface language cat cats mice lives . . . cat cat+N+PL mouse+N+PL life+N+PL live+V+3SING . . or {("cat,cat"),("cats","cat+N+PL"),......} CLINT-CL

  21. Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CLINT-CL

  22. FSA a • Used for • Recognition • Generation CLINT-CL

  23. Finite State Transducers • A finite state transducer (FST) is essentially an FSA finite state automaton that works on two (or more) tapes. • The most common way to think about transducers is as a kind of translating machine which works by reading from one tape and writing onto the other. CLINT-CL

  24. FST Definition • A 2 way FST is a quintuple (K,s,F,ixo,) where • i, o are input and output alphabets • K is a finite set of states • s  K is an initial state • FK are final states •  is a transition relation of typeK x i x o x K CLINT-CL

  25. FST a upper tape • Used for • Recognition • Generation • Translation lower tape b CLINT-CL

  26. A Very Simple Transducer a b Relation { ("a","b") } Notation a:b encodes the transition CLINT-CL

  27. A Very Simple Transducer a b also written as a:b CLINT-CL

  28. Symbol Pairs • Symbols vs. symbol pairs • In general, no distinction is made in xfst betweena the language {“a”}a:a the identity relation {(“a”, “a”)} a a CLINT-CL

  29. A Very Simple Transducer upper side a b lower side a:b CLINT-CL

  30. Relation { ("a","b"), ("aa","bb"), ...} Notationa:b* N.B. with this notation a and b must be single symbols A (more interesting) Transducer a:b 1 CLINT-CL

  31. Transducer have SeveralModes of Operation • generation mode: It writes on both tapes. A string of as on one tape and a string of bs on the other tape. Both strings have the same length. • recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs. • translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape. • translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape. CLINT-CL

  32. The Basic Idea • Morphology is regular • Morphology is finite state CLINT-CL

  33. Morphology is Regular • The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation, e.g.{ ("leaf+N+Pl","leaves"),("hang+V+Past","hung"),...} • Regular relations are closed under operations such as concatenation, iteration, union, and composition. • Complex regular relations can be derived from simpler relations. CLINT-CL

  34. Morphology is finite-state • A regular relation can be defined using the metalanguage of regular expressions. • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; • A regular expression can be compiled into a finite-state transducer that implements the relation computationally. CLINT-CL

  35. Finite-state transducer +Base: final state +3rdSg:s a t +Progr:i :n :g a l k w o r +Past:e :d initial state Compilation Regular expression • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; CLINT-CL

  36. Generation work+3rdSg --> works +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d CLINT-CL

  37. Analysis +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d talked --> talk+Past CLINT-CL

  38. XFST Demo 2 start xfst % xfst xfst[0]: • xfst[0]: regex • [{talk} | {walk} | {work}] • [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; compile a regular expression xfst[1]: apply up walked walk+Past apply the result xfst[1]: apply down talk+SgGen3 talks CLINT-CL

  39. vouloir +IndP +SG + P3 Finite-state transducer veut citation form inflection codes v o u l o i r +IndP +SG +P3 v e u t inflected form Lexical transducer • Bidirectional: generation or analysis • Compact and fast • Comprehensive systems have been built for over 40 languages: • English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, … CLINT-CL

  40. Morphotactics Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Alternations f a t +Adj +Comp t e f a t r How lexical transducers are made CLINT-CL

  41. fst 1 fst 2 fst n Sequential Model Lexical form Ordered sequence of rewrite rules (Chomsky & Halle ‘68) can be modeled by a cascade of finite-state transducers Johnson ‘72 Kaplan & Kay ‘81 Intermediate form ... Surface form CLINT-CL

  42. Sequential application k a N p a n k a m p a n k a m m a n N -> m / _ p p -> m / m _ CLINT-CL

  43. N:m 2 p m N:m ? m 0 ? p 1 N N m p 1 ? m 0 ? p:m Sequential application in detail k a N p a n k a m p a n k a m m a n 0 0 0 2 0 0 0 0 0 0 1 0 0 0 CLINT-CL

  44. 3 0 1 2 Composition of Two Stages N:m p:m k a N p a n k a m m a n N:m m 0 0 0 3 0 0 0 m ? ? p p:m N:m m N ? N N CLINT-CL

  45. b a c b:B a:A c:C b:B c:C a:A d:D Composition Example • A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z. {abc} .o. [a:A | b:B | c:C | d:D]* = [a:A b:B c:C] CLINT-CL

  46. b:y a:x c:0 Crossproduct • A .x. B The relation that maps every string in A to every string in B, and vice versa • A:B Same as [A .x. B]. a b c .x. x y [a b c] : [x y] {abc}:{xy} CLINT-CL

  47. Xerox RE Operators • $ containment • => restriction • -> replacement • @-> longest match replacement • Make it easier to describe complex languages and relations without extending the formal power of finite-state systems. CLINT-CL

  48. a ? ? a Containment $a [?* a ?*] CLINT-CL

  49. b a => b _ c b ? a c “Anyamust be preceded byb and followed byc.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression Restriction CLINT-CL

  50. a:b a b -> b a b:a ? a:b b “Replace ‘ab’ by ‘ba’.” ? a a [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression Replacement CLINT-CL

More Related