Understanding Morphology: Word Structures, Parsing, and Inflectional Systems in Linguistics

Lecture 3 Morphology: Parsing Words CS 4705

What is morphology? • The study of how words are composed from smaller, meaning-bearing units (morphemes) • Stems: children, undoubtedly, • Affixes (prefixes, suffixes, circumfixes, infixes) • Immaterial • Trying • Gesagt • Absobl**dylutely • Concatenative vs. non-concatenative (e.g. Arabic root-and-pattern) morphological systems

Agglutinative (e.g. Turkish, Japanese) vs. analytic (e.g. Mandarin) vs inflectional systems (e.g. English, Latin, Russian)

(English) Inflectional Morphology • Word stem + grammatical morpheme • Usually produces word of same class • Usually serves a syntactic function (e.g. agreement) like  likes or liked bird  birds • Nominal morphology • Plural forms • s or es • Irregular forms (goose/geese) • Mass vs. count nouns (fish/fish,email or emails?) • Possessives (cat’s, cats’)

Verbal inflection • Main verbs (sleep, like, fear) verbs relatively regular • -s, ing, ed • And productive: Emailed, instant-messaged, faxed, homered • But some are not regular: eat/ate/eaten, catch/caught/caught • Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive • Be: am/is/are/were/was/been/being • Irregular verbs few (~250) but frequently occurring • So….English inflectional morphology is fairly easy to model….with some special cases...

(English) Derivational Morphology • Word stem + grammatical morpheme • Usually produces word ofdifferent class • More complicated than inflectional • E.g. verbs --> nouns • -ize verbs  -ation nouns • generalize, realize  generalization, realization • E.g.: verbs, nouns  adjectives • embrace, pity embraceable, pitiable • care, wit  careless, witless

Example: adjective  adverb • happy  happily • But “rules” have many exceptions • Less productive: *evidence-less, *concern-less, *go-able, *sleep-able • Meanings of derived terms harder to predict by rule • clueless, careless, nerveless

Parsing • Taking a surface input and identifying its components and underlying structure • Morphological parsing: parsing a word into stem and affixes, identifying its parts and their relationships • Stem and features: • goose goose +N +SG or goose + V • geese  goose +N +PL • gooses  goose +V +3SG • Bracketing: indecipherable  [in [[de [cipher]] able]]

Why parse words? • For spell-checking • Is muncheble a legal word? • To identify a word’s part-of-speech(pos) • For sentence parsing, for machine translation, … • To identify a word’s stem • For information retrieval • Why not just list all word forms in a lexicon?

What do we need to build a morphological parser? • Lexicon: list of stems and affixes (w/ corresponding pos) • Morphotactics of the language: model of how and which morphemes can be affixed to a stem • Orthographic rules: spelling modifications that may occur when affixation occurs • in  il in context of l (in- + legal)

Using FSAs to Represent Morphotactic Models (given a lexicon) • English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n • Inputs: cats, goose, geese

q1 q2 q0 adj-root1 -er, -ly, -est un- • Derivational morphology: adjective fragment adj-root1 q5 q3 q4  -er, -est adj-root2 • Adj-root1: clear, happy, real • Adj-root2: big, red

FSAs can also represent the Lexicon • Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6

But….. • Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems • Adding new items to the lexicon means recomputing the whole FSA • Non-determinism • FSAs tell us whether a word is in the language or not – but usually we want to know more: • What is the stem? • What are the affixes and what sort are they? • We used this information to recognize the word: can we get it back?

Parsing with Finite State Transducers • cats cat +N +PL • Koskenniemi’s two-level morphology • Idea: word is a relationship betweenlexical level (its morphemes) and surface level (its orthography) • Morphological parsing : find the mapping (transduction) between lexical and surface levels

Finite State Transducers can represent this mapping • FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets • In general, FSTs can be used for • Translators (Hello:Ciao) • Parser/generator s(Hello:How may I help you?) • As well as Kimmo-style morphological parsing

FST is a 5-tuple consisting of • Q: set of states {q0,q1,q2,q3,q4} • : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O • q0: a start state • F: a set of final states in Q {q4} • (q,i:o): a transition function mapping Q x  to Q • Emphatic Sheep  Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4

FST for a 2-level Lexicon c:c a:a t:t • E.g. q3 q0 q1 q2 g e q4 q5 q6 q7 e:o e:o s

c a t +N +PL c a t s FST for English Nominal Inflection +N: reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N:

Combining (via cascade or composition) this FSA with FSAs for each noun type replaces e.g. reg-n with every regular noun representation in the lexicon (cf. J&M p.76) • e.g. Reg-noun-stem: cat q0 q7

Orthographic Rules and FSTs • Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc.

Note: These FSTs can be used for generation or recognition by simply exchanging the input and output alphabets

Summing Up • FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology • Key is to provide an FST for each of multiple levels of representation and then to combine those FSTs using a variety of operators (cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley, e.g. • Other (older) approaches are still widely used, e.g. the rule-based Porter Stemmer described in J&M appendix B • Next time: Read Ch 4

Word Classes • AKA morphological classes, parts-of-speech • Closed vs. open (function vs. content) class words • Pronoun, preposition, conjunction, determiner,… • Noun, verb, adverb, adjective,…

How do people represent words? • Hypotheses: • Full listing hypothesis: words listed • Minimum redundancy hypothesis: morphemes listed • Experimental evidence: • Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest neither • Regularly inflected forms prime stem but not derived forms • But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart)

Speech errors suggest affixes must be represented separately in the mental lexicon • easy enoughly

Understanding Morphology: Word Structures, Parsing, and Inflectional Systems in Linguistics

Understanding Morphology: Word Structures, Parsing, and Inflectional Systems in Linguistics

Presentation Transcript

Lecture 3

Lecture 3

LECTURE №3

Lecture #3

Lecture 3-3

Lecture 3

Lecture 3:

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture # 3

Lecture 3

Sea Ice

Sea Ice