1 / 31

Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 20, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings

liona
Download Presentation

Finite-State Methods in Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 20, 2005

  2. Course Outline • July 18: • Intro to computational morphology • XFST • Readings • Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. • Karttunen and Beesley, “25 Years of Finite-State Morphology” • Chapter 1: “Gentle Introduction” (B&K) • July 20: • Regular expressions • More on XFST • Readings • Chapter 2: “Systematic Introduction” • Chapter 3: “The XFST interface”

  3. July 25 • Concatenative morphotactics • Constraining non-local dependencies • Readings • Chapter 4. “The LEXC Language” • Chapter 5. “Flag Diacritics” • July 27 • Non-concatenative morphotactics • Reduplication, interdigitation • Readings • Chapter 8. “Non-Concatenative Morphotactics”

  4. August 1 • Realizational morphology • Readings • Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) • Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. • August 3 • Optimality theory • Readings • Paul Kiparsky “Finnish Noun Inflection”Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

  5. xfst -l myscript xfst -f myscript xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop Start XFST execute myscript wait for more commands from the command line Execute myscript and exit Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit. Scripting xfst

  6. Numeral Script • # This script constructs the language of English • # numerals from "one” to "ninety-nine". • # This is a comment. • # From "one" through "nine": • define OneToNine [{one} | {two} | {three} | {four} | • {five} | {six} | {seven} | {eight} | • {nine}]; • # It is convenient to define a set of prefixes that • # can be followed either by "teen" or by "ty". • define TeenTyStem [{thir} | {fif} | {six} | • {seven} | {eigh} | {nine}] ;

  7. Numeral Script (Continued) • # From "ten" to "nineteen" • define Teens [{ten} | {eleven} | {twelve} | • [TeenTyStem | {four}] {teen}]; • # Let’s define stems that can be followed "ty". • define TyStem [TeenTyStem | {twen} | {for}]; • # TyStem is followed either by "ty" or by ty-" • # and a number from OneToNine. • define Tens [TyStem [{ty} | {ty-} OneToNine]]; • define OneToNinetyNine [ OneToNine | Teens | Tens ]; • push OneToNinetyNine

  8. Analysis Generation 105 105 hundred five hundred five hundred and five one hundred and five Number to Numeral

  9. NumberToNumeral script • # This script constructs a transducer that relates the • # English numerals "one", "two", ..., "ninety-nine", • # to the corresponding numbers "1", 2 ... "99". • define OneToNine [1:{one} | 2:{two} | 3:{three} | • 4:{four} |5:{five} | 6:{six} | • 7:{seven} | 8:{eight} | 9:{nine}]; • define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}| • 7:{seven} | 8:{eigh} | 9:{nine}]; • define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} | • [TeenTyStem | 4:{four}] 0:{teen}]];

  10. NumberToNumeral (Continued) • define TyStem [2:{twen} | TeenTyStem | 4:{for}]; • # TyStem is followed either by "ty" paired with a zero • # or by "ty-" mapped to an epsilon and followed by a • # number. Note that {0} means zero and not epsilon. • define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]]; • define OneToNinetyNine [ OneToNine | Teens | Tens ]; • push OneToNinetyNine

  11. Xerox RE Operators • $ containment • => restriction • -> @-> replacement • Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

  12. a ? ? a Containment $a [?* a ?*]

  13. b a => b _ c b ? a c “Anyamust be preceded byb and followed byc.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression Restriction

  14. a:b a b -> b a b:a ? a:b b “Replace ‘ab’ by ‘ba’.” ? a a [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression Replacement

  15. a|e|i|o|u -> %[ ... %] 0:[ a [ i o ? e ] u 0:] Marking p o t a t o p[o]t[a]t[o]

  16. Multiple Results a b | b | b a | a b a -> x (a) b (a) -> x applied to “aba” a b a a b aa b a a b a a x a a x x a x Four factorizations of the input string.

  17. Directed Replace Operators • guarantee a unique result by constraining the factorization of the input string by • Direction of the match (rightward or leftward) • Length (longest or shortest)

  18. @-> Left-to-right, Longest-match Replacement (a) b (a) @-> x applied to “aba” a b a a b a a b a a b a a x a a x x a x

  19. L _ R A -> B Context Replacement The relation that replaces A by B between L and R leaving everything else unchanged. Sources of complexity: • Replacements and contexts may overlap • Alternative ways of interpreting “between left and right.” A -> B || L _ R both contexts on the input A -> B // L _ R left context on the output A -> B \\ L _ R right context on the output Conditional Replacement

  20. Left context on the input side V %: -> V || V %: C* _ Slovak v o l + a: v + a: m e: v o l + a: v + a m e we call often Left context on the output side V%: -> V // V%: C* _ Gidabal g u n u: m + b a: + d a: ng + b e: + g u n u: m + b a + d a: ng + b e + is certainly right on the stump Vowel shortening after a long vowel

  21. Shortening script define V [ a | e | i | o | u | a ]; define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define SlovakShortening %: -> 0 || V %: C* V _ ; define GidabalShortening %: -> 0 // V %: C* V _ ; push SlovakShortening down vola:va:me: vola:vame push GidabalShortening down gunu:mba:da:ngbe: gunu:mbada:ngbe

  22. Palatalization and Vowel Raising • Palatalization • tim --> cim • Vowel Raising • memi --> mimi • Interaction • temi --> cimi • tememi --> cimimi

  23. Vowel Raising & Palatalization define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define Raising e -> i \\ _ C* i ; define Palatalization t -> c || _ i; regex Raising .o. Palatalization; down memi mimi down tim cim down temi cimi down tememi cimimi t e m e m i t i m i m i c i m i m i

  24. Morphotactics Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Alternations Making a lexical transducer

  25. Finnish Gradation Script • define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | • {katto} | {juttu} |{tikka} | {huppu} | • {rotta} | {nahka} |{lika} | {maku} | • {rako} | {tuke} | {halko} | {jalka} | • {virka} | {lanka} | {linko} | {puku} | • {suku} | {tiuku} | {raaka} |{ripa} | • {sopu} | {tapa} | {kampa} | {rumpu} | • {sampe} | {sota} | {pata} | {kita} | • {rinta} | {kanto} | {ranta} | {ilta} | • {kulta} | {parta} | {kerta} ]; • define Case [ "+Part":a | "+Gen":n ]; • define Finnish [Stems Case];

  26. Auxiliary definitions • define V [a | e | i | o | u | y | ä | ö]; • define C [b | c | d | f | g | h | j | k | l | m | n | • p | q | r | s | t | v | w | x | z]; • define Coda [ C [C | .#.] ]; • define ClosedSyll [V Coda] ;

  27. Weak form of k • define WeakK k -> ' || V a _ a Coda, V u _ u Coda • .o. • k -> j || r _ e Coda • .o. • k -> v || u _ u Coda • .o. • k -> g || n _ V Coda • .o. • k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail', • # nahkan 'skin

  28. Weak form of p • define WeakP p -> m || m _ V Coda • .o. • p -> v || \[s|p] _ V Coda # piispan 'bishop' • .o. • p -> 0 || p _ V Coda;

  29. Weak form of t • define WeakT t -> n || n _ V Coda • .o. • t -> l || l _ V Coda • .o. • t -> r || r _ V Coda • .o. • t -> d || \[s|t] _ V Coda # koston revenge • .o. • t -> 0 || t _ V Coda ;

  30. Putting it all together • define Gradation WeakK .o. WeakP .o. WeakT; • regex Finnish .o. Gradation; • print lower-words • echo *** Size of Finnish .o. Gradation • print size • echo *** Size of Finnish • push Finnish • print size • echo *** Size of Gradation • push Gradation • print size

  31. Syllabification define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i

More Related