Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 20, 2005

Course Outline • July 18: • Intro to computational morphology • XFST • Readings • Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. • Karttunen and Beesley, “25 Years of Finite-State Morphology” • Chapter 1: “Gentle Introduction” (B&K) • July 20: • Regular expressions • More on XFST • Readings • Chapter 2: “Systematic Introduction” • Chapter 3: “The XFST interface”

July 25 • Concatenative morphotactics • Constraining non-local dependencies • Readings • Chapter 4. “The LEXC Language” • Chapter 5. “Flag Diacritics” • July 27 • Non-concatenative morphotactics • Reduplication, interdigitation • Readings • Chapter 8. “Non-Concatenative Morphotactics”

August 1 • Realizational morphology • Readings • Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) • Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. • August 3 • Optimality theory • Readings • Paul Kiparsky “Finnish Noun Inflection”Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

xfst -l myscript xfst -f myscript xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop Start XFST execute myscript wait for more commands from the command line Execute myscript and exit Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit. Scripting xfst

Numeral Script • # This script constructs the language of English • # numerals from "one” to "ninety-nine". • # This is a comment. • # From "one" through "nine": • define OneToNine [{one} | {two} | {three} | {four} | • {five} | {six} | {seven} | {eight} | • {nine}]; • # It is convenient to define a set of prefixes that • # can be followed either by "teen" or by "ty". • define TeenTyStem [{thir} | {fif} | {six} | • {seven} | {eigh} | {nine}] ;

Numeral Script (Continued) • # From "ten" to "nineteen" • define Teens [{ten} | {eleven} | {twelve} | • [TeenTyStem | {four}] {teen}]; • # Let’s define stems that can be followed "ty". • define TyStem [TeenTyStem | {twen} | {for}]; • # TyStem is followed either by "ty" or by ty-" • # and a number from OneToNine. • define Tens [TyStem [{ty} | {ty-} OneToNine]]; • define OneToNinetyNine [ OneToNine | Teens | Tens ]; • push OneToNinetyNine

Analysis Generation 105 105 hundred five hundred five hundred and five one hundred and five Number to Numeral

NumberToNumeral script • # This script constructs a transducer that relates the • # English numerals "one", "two", ..., "ninety-nine", • # to the corresponding numbers "1", 2 ... "99". • define OneToNine [1:{one} | 2:{two} | 3:{three} | • 4:{four} |5:{five} | 6:{six} | • 7:{seven} | 8:{eight} | 9:{nine}]; • define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}| • 7:{seven} | 8:{eigh} | 9:{nine}]; • define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} | • [TeenTyStem | 4:{four}] 0:{teen}]];

NumberToNumeral (Continued) • define TyStem [2:{twen} | TeenTyStem | 4:{for}]; • # TyStem is followed either by "ty" paired with a zero • # or by "ty-" mapped to an epsilon and followed by a • # number. Note that {0} means zero and not epsilon. • define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]]; • define OneToNinetyNine [ OneToNine | Teens | Tens ]; • push OneToNinetyNine

Xerox RE Operators • $ containment • => restriction • -> @-> replacement • Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

a ? ? a Containment $a [?* a ?*]

b a => b _ c b ? a c “Anyamust be preceded byb and followed byc.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression Restriction

a:b a b -> b a b:a ? a:b b “Replace ‘ab’ by ‘ba’.” ? a a [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression Replacement

a|e|i|o|u -> %[ ... %] 0:[ a [ i o ? e ] u 0:] Marking p o t a t o p[o]t[a]t[o]

Multiple Results a b | b | b a | a b a -> x (a) b (a) -> x applied to “aba” a b a a b aa b a a b a a x a a x x a x Four factorizations of the input string.

Directed Replace Operators • guarantee a unique result by constraining the factorization of the input string by • Direction of the match (rightward or leftward) • Length (longest or shortest)

@-> Left-to-right, Longest-match Replacement (a) b (a) @-> x applied to “aba” a b a a b a a b a a b a a x a a x x a x

L _ R A -> B Context Replacement The relation that replaces A by B between L and R leaving everything else unchanged. Sources of complexity: • Replacements and contexts may overlap • Alternative ways of interpreting “between left and right.” A -> B || L _ R both contexts on the input A -> B // L _ R left context on the output A -> B \\ L _ R right context on the output Conditional Replacement

Left context on the input side V %: -> V || V %: C* _ Slovak v o l + a: v + a: m e: v o l + a: v + a m e we call often Left context on the output side V%: -> V // V%: C* _ Gidabal g u n u: m + b a: + d a: ng + b e: + g u n u: m + b a + d a: ng + b e + is certainly right on the stump Vowel shortening after a long vowel

Shortening script define V [ a | e | i | o | u | a ]; define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define SlovakShortening %: -> 0 || V %: C* V _ ; define GidabalShortening %: -> 0 // V %: C* V _ ; push SlovakShortening down vola:va:me: vola:vame push GidabalShortening down gunu:mba:da:ngbe: gunu:mbada:ngbe

Palatalization and Vowel Raising • Palatalization • tim --> cim • Vowel Raising • memi --> mimi • Interaction • temi --> cimi • tememi --> cimimi

Vowel Raising & Palatalization define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define Raising e -> i \\ _ C* i ; define Palatalization t -> c || _ i; regex Raising .o. Palatalization; down memi mimi down tim cim down temi cimi down tememi cimimi t e m e m i t i m i m i c i m i m i

Morphotactics Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Alternations Making a lexical transducer

Finnish Gradation Script • define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | • {katto} | {juttu} |{tikka} | {huppu} | • {rotta} | {nahka} |{lika} | {maku} | • {rako} | {tuke} | {halko} | {jalka} | • {virka} | {lanka} | {linko} | {puku} | • {suku} | {tiuku} | {raaka} |{ripa} | • {sopu} | {tapa} | {kampa} | {rumpu} | • {sampe} | {sota} | {pata} | {kita} | • {rinta} | {kanto} | {ranta} | {ilta} | • {kulta} | {parta} | {kerta} ]; • define Case [ "+Part":a | "+Gen":n ]; • define Finnish [Stems Case];

Auxiliary definitions • define V [a | e | i | o | u | y | ä | ö]; • define C [b | c | d | f | g | h | j | k | l | m | n | • p | q | r | s | t | v | w | x | z]; • define Coda [ C [C | .#.] ]; • define ClosedSyll [V Coda] ;

Weak form of k • define WeakK k -> ' || V a _ a Coda, V u _ u Coda • .o. • k -> j || r _ e Coda • .o. • k -> v || u _ u Coda • .o. • k -> g || n _ V Coda • .o. • k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail', • # nahkan 'skin

Weak form of p • define WeakP p -> m || m _ V Coda • .o. • p -> v || \[s|p] _ V Coda # piispan 'bishop' • .o. • p -> 0 || p _ V Coda;

Weak form of t • define WeakT t -> n || n _ V Coda • .o. • t -> l || l _ V Coda • .o. • t -> r || r _ V Coda • .o. • t -> d || \[s|t] _ V Coda # koston revenge • .o. • t -> 0 || t _ V Coda ;

Putting it all together • define Gradation WeakK .o. WeakP .o. WeakT; • regex Finnish .o. Gradation; • print lower-words • echo *** Size of Finnish .o. Gradation • print size • echo *** Size of Finnish • push Finnish • print size • echo *** Size of Gradation • push Gradation • print size

Syllabification define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i

Finite-State Methods in Natural Language Processing