Computational Morphology

Computational Morphology

Outline • What is morphology? • Word structure • Types of morphological operation • Levels of affixation • Computational approaches to morphology • Finite State transducers • Two level morphology • Koskenniemi’s rule formalism Morphology S.Ananiadou

References • L. Bauer (1988) Introducing Linguistic Morphology, EUP • A. Spencer (1991) Morphological Theory, Blackwells • Jurafsky, D. & Martin, J. (2000) Speech and Language Processing, Chapter 3. • Koskenniemi, K. & Church, K. (1988) “Complexity, two-level morphology, and Finnish”, in COLING-88, Budapest, pp.335-339. • Ananiadou, S. & McNaught, J. (1986) A Review of Two-level Morphology. Eurotra Research Paper. September 1986. Morphology S.Ananiadou

What is morphology? • Morphology is the study of the way words are built up from smaller meaning bearing units, morphemes. • ‘antiintellectualism’ -anti -ism -al -intellect • Free and bound morphemes • intellect (free) • anti- -ism, -al (bound) • Stems and affixes • Complex words contain a central morpheme, which contributes the basic meaning, and a collection of other morphemes serving to modify this meaning in different ways. Morphology S.Ananiadou

‘disagreements’ agree (stem) dis- -ment -s (affixes) dis- prefix -ment suffix -s suffix • English doesn’t stack more than 4-5 affixes, Turkish 10 affixes. Agglutinative language. • Two broad classes of ways to form words from morphemes: inflection and derivation. • Inflection: is the combination of a word stem with a grammatical morpheme, resulting in a word of the same class • cat-s cats play-ed • Derivation: combination of a word stem with a grammatical morpheme, resulting in a word of a different class • agree -ment Morphology S.Ananiadou

English Inflectional Morphology • English nouns have two kinds of inflection: plural & possessive • cat cats / ibis ibises / finch finches / box boxes • llama’s / children’s / llamas’ • English verbal inflection is more complicated • main verbs (eat/sleep) • modal verbs (can / will/ should) • primary verbs (be, have, do) (see Quirk et al: Grammar of English Language) • Regular verbs (walk / walks / walking / walked) • Irregular verbs (eat / eats / eating / ate / eaten) Morphology S.Ananiadou

Derivational Morphology • Syntactic category changing e.g. nominalization computerize  computerization Suffix Base Noun/Verb/adjective Derived Noun -ation computerize computerization -ee appoint appointee -er kill killer -ness fuzzy fuzziness -al computation computational -able like likeable -less clue clueless Morphology S.Ananiadou

Derivation is less productive • Affixes attach to stems and to each other according to certain constraints • Level Ordering in Derivation • In English we distinguish 2 types of affixation • class I affixation (+) • class II affixation (#) • Class I occurs before class II I -> ion, ity, ate, ive, ic ... II -> y, ly, like, ful, ness, less, hood … danger-ous1-ness2 *fear-less2-ity1 *tender-ness2-ous1 Morphology S.Ananiadou

Members of the same family may appear in any order with respect to each other fear-less-ness tender-ness-less • Ordering Hypothesis about occurrence of morphological processes occur or morphotactics Class I affixation Class II affixation Inflection Compounding Morphology S.Ananiadou

Finite State Morphological Parsing • Take an input like ‘cats’ and produce output forms like ‘cat +N +PL’ (morphological features) • In order to build a morphological parser we need: • lexicon • morphotactics • orthographic rules (spelling rules) model the changes occurring when two morphemes combine e.g. city  cities How to use FSA to model morphotactic information FST as a way of modeling morphological features in the lexicon How to use FSTs to model orthographic rules Morphology S.Ananiadou

Lexicon and Morphotactics • A lexicon is a repository of words • Since we cannot list every word in the language, computational lexicons are structured as a list of stems and affixes with a representation of the morphotactics. • One way to model morphotactics is the finite-state automaton Reg-noun Plural q0 q1 q2 Irregular-pl-noun Irreg-sg-noun Morphology S.Ananiadou

reg-noun irreg-pl-noun irreg-sg-noun plural fox geese goose -s cat sheep sheep dog mice mouse aardvark reg-verb irreg-verb- irreg-past past past-part pres-part 3sg stem stem verb walk cut caught -ed -ed -ing -s fry speak ate talk sing eaten impeach sang spoken • English derivational morphology is more complex than inflectional morphology, automata for modeling are complex Morphology S.Ananiadou

Morphotactics for English adjectives big, bigger, biggest cool, cooler, coolest, coolly red, redder, reddesr clear, clearer, clearest, clearly, unclear, unclearly happy, happier, happiest, happily unhappy, unhappier, unhappiest, unhappily real, unreal, really • we need to set up classes of roots and specify which can occur with which suffixes • Adj-root1 would include adjectives that can occur with un- and -ly (clear, happy, real) • Adj-root2 will include adjectives that can’t (big, cool, red) Morphology S.Ananiadou

An FSA for a fragment of English adjective morphology adj-root1 -er, -ly, -est q2 un- q1 q0 adj-root1 q5  q4 q3 adj-root2 -er -est Morphology S.Ananiadou

We can use FSAs to solve the problem of morphological recognition; determining whether an input string of letters makes up a legitimate English word or not. • We do this by taking the morphotactic FSAs and plugging in each sub-lexicon into the FSA • we expand each arc (reg-noun-stem arc) with all the morphemes that make up the set of reg-noun-stem. Morphology S.Ananiadou

Morphological parsing with FSTs • Given input cats, we want output cat + N +PL telling us that cat is a plural noun • We do this via two-level morphology (TLM) • TLM represents a word as a correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word. • Morphological parsing is implemented by building mapping rules that map letter sequences like cats on the surface level into morpheme and features sequences like cat + N + PL on the lexical level • the automaton used for this mapping is the finite-state transducer or FST Morphology S.Ananiadou

FST • FST maps sets of symbols via a finite automaton • We visualize an FST as a two-tape automaton which recognizes or generates pairs of strings. • FST defines a relation between sets of strings; an FST is a machine that reads one string and generates another lexical c a t +N +PL c a t s surface Morphology S.Ananiadou

An FST accepts a language over pairs of symbols, as in:  = { a : a, b : b, ! : !, a : !, a : ,  : !} • For TLM we view an FST as having two tapes; the upper or lexical tape, is composed from characters from the left side of the a : b pairs, the lower or surface tape, is composed of characters from the right side of the a : b pairs. • Dictionary, text: each consist of a sequence of items • items of the dictionary are expressed according to an alphabet which consists of {a…z}, 0 (empty character), + morpheme boundary character, set of archi-phonemes e.g. S for {s, z} • items of text are expressed by a subset of this alphabet {a…z, 0} Morphology S.Ananiadou

We can build an FST morphological parser out of a morphotactic FSA and lexica by adding an extra lexical tape and the appropriate morphological features Reg-noun-stem q1 +N: q4 +PL: ^s# q0 +N: Irreg-sg-noun-f q2 q5 +SG:# q7 +N: Irreg-pl-noun-f q3 q6 +PL:# Morphology S.Ananiadou

Koskeniemmi’s work • In this model, all FST’s treating individual phenomena operate in parallel • so rule ordering and interactions between rules is not necessary part of morphological description • all FSTs share the same two heads but otherwise operate completely independently • heads move at the same time • to have an overall correspondence between lexical and surface string, two heads must have reached the end of two strings, and all the FSTs must be in a final state • when all FSTs agree, a correspondence is reached • if only one FST blocks while scanning the two strings then the proposed correspondence is rejected Morphology S.Ananiadou

……… t r i e s ………… FST1 FST2 FST3 … FSTn ………… t r Y + s ….. text d o g 0 s surface tape      FST dictionary      lexical tape d o g + S sequence of mappings d,d o,o g,g 0,+ s,S • the morpheme boundary + corresponds to nothing on the surface; the S archiphoneme / grapheme corresponds to surface s. Morphology S.Ananiadou

Koskenniemi’s rule formalism • The general form of a rule is CP op LC --- RC • CP = correspondence part; this is a concrete or abstract character pair whose occurrence is restricted by the rule • op = an operator, one of four types; four types of rules • LC, RC = left context, right context The Rules  Exclusion rule: a : b /  LC - RC a may not be realised as b, in the context LC-RC a:b not allowed in given context Morphology S.Ananiadou

 Context restriction rule a:b  LC --RC a may be realised as b only in the given context, and nowhere else; a:b allowed in given context  Surface coercion rule a:b  LC-RC a must be realised as b in the given context; a:b required in given context  Composite rule a:b  LC--RC this rule is a combination of context restriction and surface coercion; a lexical a must correspond to surface b in the given context, and this correspondence is licit only in that context; a:b required in given context and nowhere else Morphology S.Ananiadou

Example of Koskenniemi’s rule formalism • Treats epenthesis in English • Epenthesis: a morpheme boundary +, is realised as an ‘e’ on the surface when it follows ‘ch’, ‘sh’, ‘s’, ‘x’, ‘z’ or ‘y/i’ and occurs before an ‘s’. Otherwise the lexical character + corresponds to 0 on the surface (empty string) • foxes, churches, spies (+:e) • +/e  { { c | s (h) } | S | y/i} --s CP op LC RC • CP, LC, RC consist of sequences of pairs, the first member of a pair drawn from the lexical alphabet, the second from the surface alphabet Morphology S.Ananiadou

Computational Morphology