1 / 45

Computational Morphology

Computational Morphology. Lauri Karttunen. Computational morphology. The big questions Efficient generation and recognition Common data format Common "runtime" algorithm for all languages Established results Lexical representations are regular languages

billy
Download Presentation

Computational Morphology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Morphology Lauri Karttunen

  2. Computational morphology • The big questions • Efficient generation and recognition • Common data format • Common "runtime" algorithm for all languages • Established results • Lexical representations are regular languages • Morphological alternations are regular relations • Regular relations can be compiled into finite-state transducers • Burning issues • Nonconcatenative phenomena: reduplication (Malay), interdigitation (Arabic) • Nonlocal dependencies • Syntax/morphology interface

  3. Overview • Computational morphology • A success story … • Realizational Morphology (is finite-state) • Lexical representations • Realization rules • Morphophonological rules • Rules of referral • Elsewhere principle (Panini's principle) • Challenges

  4. Analysis Generation hang V Past leaf N Pl leave N Pl leave V Sg3 leaves hanged hung Computational morphology

  5. Two challenges • Morphotactics • Words are composed of smaller elements that must be combined in a certain order: • piti-less-ness is English • piti-ness-less is not English • Phonological alternations • The shape of an element may vary depending on the context • pity is realized as pitiin pitilessness • die becomes dy in dying

  6. Morphology is regular (=rational) • The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation. • A regular relation consists of ordered pairs of strings. • leaf+N+Pl : leaves hang+V+Past : hung • Any finite collection of such pairs is a regular relation. • Regular relations are closed under operations such as concatenation, iteration, union, and composition. • Complex regular relations can be derived from simple relations.

  7. l e a f +N +Pl l e a v e s final state initial state path Morphology is finite-state • A regular relation can be defined using the metalanguage of regular expressions. • A regular expression can be compiled into a finite-state transducer that implements the relation computationally.

  8. Regular morphotactics • Principles of word-formation in most languages can be defined as a regular language or relation using operators such as concatenation and union. • Toy example: | = union, () = optionality • Noun: ear | father • Adj: clear | clever | fat • Adv: ever • NPref: anti • AdjSuff : er | est • NSuf : s • English: (NPref) Noun (NSuf) | Adj (AdjSuff) | Adv

  9. Simple lexicon a e n i t f a t a a r l e c v e h r a e s t r v e r f s e e a t h e (NPref) Noun (NSuf) | Adj (AdjSuff) | Adv

  10. Regular alternations • Phonological alternations can be represented as regular relations using special regular expression operators • Ordered rewrite systems (Panini 500 BC, Chomsky&Halle 1968) • Parallel “two-level” systems (Koskenniemi 1983) • Simple gemination rule for English • t -> t t || .#. C* V _ e [ r | s t ] • “Geminate t at the end of a monosyllabic stem with a single vowel that is followed by er or est.” (fat+er -> fatter vs. great+er->greater).

  11. Transducer lexicon a e n i t f a t a a r l e c v e h r a e s t r v e r f e s e 0:t a t h e [(NPref) Noun (NSuf) | Adj (AdjSuff) | Adv] .o. [t -> t t || _ e [r | s t]

  12. vouloir +IndP +SG + P3 Finite-state transducer veut citation form inflection codes v o u l o i r +IndP +SG +P3 v e u t inflected form Lexical transducer • Bidirectional: generation or analysis • Compact and fast • Comprehensive systems have been built for over 20 languages: • English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Basque, Greek, Arabic, Bulgarian, …

  13. Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Morphology is a solved problem

  14. Who cares? • The success of computational morphology has not made any impact within linguistics. • Computational concerns: • completeness of coverage, physical size, speed of application, formal power, … • Academic concerns: • explanation, universal principles, generalizations, theoretical predictions, elegant formalism, … • Let's try to build a bridge …

  15. Realizational Morphology • Gregory Stump, Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. • A rich set of notational conventions designed to capture important linguistic generalizations. • Interpretable, precise formalism. • Computational implementation in DATR (Finkel & Stump 2002). • The good news: Realizational morphology is a finite-state model.

  16. Finite-state advantage • Casting Stump's system into a regular expression formalism that has a compiler has a fundamental advantage over implementation in systems such as DATR. • DATR can be used to generate an inflected surface form from its lexical representation but it is not directly usable for recognition. In contrast, finite-state transducers are bidirectional generator/recognizers. • Issues to be addressed: • Lexical representations • Realization rules (= rules of exponence) • Morphophonological rules • Rules of referral • Rule ordering by general principles

  17. Lingala verb nakobeti 'I hit you': <bet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> Lexical representation < Stem, Features> A phonological representation A set of morphological properties

  18. RR3, Obj:[Per:2, Num:Sg], V(<X,s>) =def <koX, s> The singular second person object agreement features are realized by prefixing "ko" to the beginning of the current phonological form. The rule appears in Block 3 and applies to verbs. Realization rule phonological input phonological output features RRn,t,C(<X,s>) =def <Y', s> rule block features realized by the rule category

  19. Rule application • Realization rules are ordered into blocks by the linguist. • Within blocks, the ordering is determined by specificity (Elsewhere rule, Panini's principle). • The final output of a realization rule may depend on morphophonological rules. • X " Y " Y'

  20. RR3, Obj:[Per:2, Num:Sg], V(<X,s>) =def <koX, s> <kobet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> RR1, Subj:[Per:1, Num:Sg], V(<X,s>) =def <naX, s> <nakobet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> RR5, Tnsj:Past:Rec, V(<X,s>) =def <Xi, s> <nakobeti, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> 1 3 5 Cascade of rule applications <bet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}>

  21. Observations • The lexical representations of Realizational Morphology constitute a regular language. • They can be described by a regular expression. • All examples of realization rules given in Stump's book represent regular relations. • They can be compiled compiled into finite-state transducers. • Because regular relations are closed under composition, the cascade of rule applications yields a single transducer. • We can eliminate the features from the surface side once the composition has been done.

  22. , Sub e < b t { : a k b t i n o e [ Per : 1 , Num : Sg ] … etc. A path in the lexical transducer for Lingala mapping the surface form nakobetidirectly into its lexical representation <bet, {Sub:[Per:1, Num: Sg], Obj:[Per:2, Num:Sg],Tns:Past:Rec}>, and vice versa. Literal example In a real application, one would prefer a more parsimonious encoding of the feature structure.

  23. Realization rules • Stump's realization rules can easily be expressed in Parc/XRCE regular expression formalism. • Example: • RR3, Obj:[Per:2, Num:Sg], V(<X,s>) =def <koX, s> • define R301 [. .] -> {ko} || "<" _ $[ObjAgr & $2 & $Sg] • "Rule R301: Insert (= rewrite the empty string as) "ko" • to the beginning of a phonological form whose object • agreement features contain the values 2 and Sg."

  24. Morphophonological rules • The output of a realization rule may be subject to a morphophonological rule. • Stump's morphophonemic rules are simple rewrite rules, easily expressed in the Parc/XRCE regular expression formalism. • If X=W[vowel1] and Y=X[vowel2]Z, then the indicated [volwel2] is absent from Y'. • Vowel -> 0 || Vowel "+" _ ; • where "+" marks the place where the suffix is inserted.

  25. Rules of referral • Realization rules may be defined in terms of other realization rules. • The same affix can express more than one bundle of morphological features (syncretism). • In Lingala, mo expresses class 4 singular 3rd person agreement for subjects and objects. • In the Parc/XRCE regular expression formalism, a rule of referral corresponds to a substitution operation. • If R305 is the object agreement rule, the corresponding subject agreement rule is • `[R305, Obj, Sub] • It yields a transducer identical to R305 except that the insertion of mo is controlled by subject agreement features.

  26. Elsewhere principle • While the rule blocks are ordered by the linguist, the realization rules within each block and the morphophonologicalrules are ordered by specificity. • A specific rule takes precedence over a more general rule in cases where both are applicable. • This principle is very important for Stump. But he gives no precise definition for it within his formalism. • The Elsewhere Principle is an extremely simple notion for realization rules and for symbol-to-symbol morphophonological rules in a finite-state model.

  27. Rule A k -> 0 || Vowel _ Vowel Rule B k -> v || u _ u Specific vs. General

  28. Input/Output languages • Rule A and Rule B have the same input language: the universal (= "sigma star") language. • Both rules can be applied without failure to any string. If the context is not met, the output is the same as the input. • The output languages are not the same. • A "successful" application an obligatory rule removes from the output language the strings to which it has applied. • Every string missing from the output language of Rule B is missing from the output language of Rule A, but not vice versa. • The output language of Rule A is a proper subset of the output language of Rule B.

  29. Output language of Rule A Rule A k -> 0 || Vowel _ Vowel Rule B k -> v || u _ u

  30. Output language of Rule B Rule A k -> 0 || Vowel _ Vowel Rule B k -> v || u _ u

  31. Principled rule ordering • The relationship of any two rules A and B that insert a string or replace a particular symbol can be determined by the following method: • Extract the output languages (a finite-state operation). • Check whether one is the proper subset of the other (a finite-state operation). • This determination can be done efficiently and without any knowledge of how the rules were expressed.

  32. Discussion • It is evident that Realizational Morphology is yet another variant of finite-state morphology. • Stump could say: "Your theory is a notational variant of mine but mine is better." • There are many examples where notation matters: • B => A _ C "B must occur between A and C." • ~[ [~[?* A] B ?*] | [?* B ~[C ?*] ] ] • Stump's convoluted and cumbersome notation takes no advantage of the nice formal and computational properties that it in fact has. It is a finite-state model that does not know its name.

  33. Morphotactic challenges • Most languages build words by concatenation • un+think+ing+ly • paris+mut+nngau+niraq+lauq+sima+nngit+junga (Inuktitut) • (parimunngauniralauqsimanngittunga = I never said I wanted to go to Paris) • Some languages also have nonconcatenative processes of word formation • Arabic interdigitation • Malay reduplication

  34. Non-concatenative stem:ktb CVVCVC ui “root” “template” “vocalization” kuutib Interdigitation in Arabic Concatenative: kuutib + a “stem” “suffix” The root, template and vocalization morphemes “interdigitate” into a stem.

  35. Full-stem reduplication in Malay • In Malay, the overt plural of bagi (“suitcase”) is bagibagi (orthographically bagi-bagi); the plural of peraturan (“rule”) is peraturanperaturan, etc. • To model such pluralization, you need to copy the stem, no matter what it is and no matter how long it is. • Such “full-stem reduplication” appears to be far beyond finite-state power • The copy language, {ww | we L}, is context-sensitive.

  36. Compile-replace: a new algorithm • Define networks using concatenation, as before, but in such a way that the paths in the network may themselves contain regular expressions. • Reapply the compiler to its own output, compiling the regular expression substrings and replacing them with the result of the compilation.

  37. a ^[ * ^] A non-linguistic example: before compile-replace Network containing a regular expression, a* delimited with ^[ and ^] .

  38. *:a 0:a a *:0 a:0 *:0 Non-linguistic example: after compile-replace Maps every string in the infinite a* language to the regular expression from which the language was compiled.

  39. Iteration operator • ^n • A^2 denotes two concatenations of the languageA with itself, equvalent to [A A]. • A = {bagi, pelanbuhan}, • A^2 = {bagibagi, bagipelanbuhan, pelanbuhanbagi, pelanbuhanpelanbuhan}. • Finite-state languages and relations are closed under n-ary concatenation.

  40. Compile-replace in Malay • Before • Lemma: b a g i +Noun +Plural • Underlying form: ^[ { b a g i } ^2 ^] • After • Lemma: b a g i +Noun +Plural • Surface string: b a g i b a g i • The compile replace operation does not create any ill-formed reduplicates such as pelabuhanbagi.

  41. Merge operators for Arabic • Merge a Filler into a Template • .m>.is the “merge to the right” operator and • .<m.is the “merge to the left operator”. k t b .m>. C V V C V C • k V V t V b • k V V t V b .<m. u* i • k u u t i b

  42. Compile-replace in Arabic:before and after • Before • Lemma: k t b =Root C V C V C =Template a + =Voc • Underlying: ^[ k t b .m>. C V C V C .<m. a + ^] • After • Lemma: k t b =Root C V C V C =Template a + =Voc • Surface: k a t a b Alternation rules apply to the interdigitated stems to produce the real surface strings.

  43. XRCE Arabic • Lexicon • 4930 roots • 400 phonologically distinct patterns • 90,000 stems • 72 million words • Rules • 66 alternation rules for deletion, assimilation, etc. • Construction • compile-replace algorithm merges roots and patterns to form stems • composition with alternation rules creates the final transducer with optional vowels • time required: a few minutes

  44. Remaining issues • Efficient treatment of non-local dependencies • prefix … stem … suffix … Conclusion • Computationally, morphology is a solved problem. Syntax-morphology interface

  45. References Lauri Karttunen, "Computing with Realizational Morphology" in CICLing-2003, A. Gelbukh (ed.), Lecture Notes in Computer Science 2588, pages 205-216. Springer Verlag. 2003. • For a copy write to karttune@parc.com • This PowerPoint presentation will be available at a local web site. Kenneth R. Beesley & Lauri Karttunen, Finite-State Morphology, CSLI Publications. February 2003. (Software included).

More Related