1 / 25

Lecture 3

Lecture 3. Morphology: Parsing Words. What is morphology?. The study of how words are composed from smaller, meaning-bearing units ( morphemes ) Stems: child ren, un doubt edly, Affixes (prefixes, suffixes, circumfixes, infixes) Im material Try ing Ge sag t Abso bl**dy lutely

monicak
Download Presentation

Lecture 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3 Morphology: Parsing Words CS 4705

  2. What is morphology? • The study of how words are composed from smaller, meaning-bearing units (morphemes) • Stems: children, undoubtedly, • Affixes (prefixes, suffixes, circumfixes, infixes) • Immaterial • Trying • Gesagt • Absobl**dylutely • Concatenative vs. non-concatenative (e.g. Arabic root-and-pattern) morphological systems

  3. Morphology Helps Define Word Classes • AKA morphological classes, parts-of-speech • Closed vs. open (function vs. content) class words • Pronoun, preposition, conjunction, determiner,… • Noun, verb, adverb, adjective,…

  4. (English) Inflectional Morphology • Word stem + grammatical morpheme • Usually produces word of same class • Usually serves a syntactic function (e.g. agreement) like  likes or liked bird  birds • Nominal morphology • Plural forms • s or es • Irregular forms (goose/geese) • Mass vs. count nouns (fish/fish,email or emails?) • Possessives (cat’s, cats’)

  5. Verbal inflection • Main verbs (sleep, like, fear) verbs relatively regular • -s, ing, ed • And productive: Emailed, instant-messaged, faxed, homered • But some are not regular: eat/ate/eaten, catch/caught/caught • Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive • Be: am/is/are/were/was/been/being • Irregular verbs few (~250) but frequently occurring • So….English inflectional morphology is fairly easy to model….with some special cases...

  6. (English) Derivational Morphology • Word stem + grammatical morpheme • Usually produces word ofdifferent class • More complicated than inflectional • E.g. verbs --> nouns • -ize verbs  -ation nouns • generalize, realize  generalization, realization • E.g.: verbs, nouns  adjectives • embrace, pity embraceable, pitiable • care, wit  careless, witless

  7. E.g.: adjective  adverb • happy  happily • But “rules” have many exceptions • Less productive: *evidence-less, *concern-less, *go-able, *sleep-able • Meanings of derived terms harder to predict by rule • clueless, careless, nerveless

  8. Parsing • Taking a surface input and identifying its components and underlying structure • Morphological parsing: parsing a word into stem and affixes, identifying its parts and their relationships • Stem and features: • goose goose +N +SG or goose + V • geese  goose +N +PL • gooses  goose +V +3SG • Bracketing: indecipherable  [in [[de [cipher]] able]]

  9. Why parse words? • For spell-checking • Is muncheble a legal word? • To identify a word’s part-of-speech(pos) • For sentence parsing, for machine translation, … • To identify a word’s stem • For information retrieval • Why not just list all word forms in a lexicon?

  10. How do people represent words? • Hypotheses: • Full listing hypothesis: words listed • Minimum redundancy hypothesis: morphemes listed • Experimental evidence: • Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest neither • Regularly inflected forms prime stem but not derived forms • But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart)

  11. Speech errors suggest affixes must be represented separately in the mental lexicon • easy enoughly

  12. What do we need to build a morphological parser? • Lexicon: list of stems and affixes (w/ corresponding pos) • Morphotactics of the language: model of how and which morphemes can be affixed to a stem • Orthographic rules: spelling modifications that may occur when affixation occurs • in  il in context of l (in- + legal)

  13. Using FSAs to Represent English Plural Nouns • English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n • Inputs: cats, geese, goose

  14. q1 q2 q0 adj-root1 -er, -ly, -est un- • Derivational morphology: adjective fragment adj-root1 q5 q3 q4  -er, -est adj-root2 • Adj-root1: clear, happy, real (clearly) • Adj-root2: big, red (~bigly)

  15. FSAs can also represent the Lexicon • Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6

  16. But….. • Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems • Adding new items to the lexicon means recomputing the whole FSA • Non-determinism • FSAs tell us whether a word is in the language or not – but usually we want to know more: • What is the stem? • What are the affixes and what sort are they? • We used this information to recognize the word: can we get it back?

  17. Parsing with Finite State Transducers • cats cat +N +PL (a plural NP) • Koskenniemi’s two-level morphology • Idea: word is a relationship betweenlexical level (its morphemes) and surface level (its orthography) • Morphological parsing : find the mapping (transduction) between lexical and surface levels

  18. Finite State Transducers can represent this mapping • FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets • In general, FSTs can be used for • Translators (Hello:Ciao) • Parser/generator s(Hello:How may I help you?) • As well as Kimmo-style morphological parsing

  19. FST is a 5-tuple consisting of • Q: set of states {q0,q1,q2,q3,q4} • : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O • q0: a start state • F: a set of final states in Q {q4} • (q,i:o): a transition function mapping Q x  to Q • Emphatic Sheep  Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4

  20. FST for a 2-level Lexicon c:c a:a t:t • E.g. q3 q0 q1 q2 g e q4 q5 q6 q7 e:o e:o s

  21. c a t +N +PL c a t s FST for English Nominal Inflection +N: reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N:

  22. Useful Operations on Transducers • Cascade: running 2+ FSTs in sequence • Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) • Composition: apply FST2 transition function to result of FST1 transition function • Inversion: exchanging the input and output alphabets (recognize and generate with same FST) • cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley

  23. Orthographic Rules and FSTs • Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc.

  24. Porter Stemmer • Used for tasks in which you only care about the stem • IR, modeling given/new distinction, topic detection, document similarity • Rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) • Not perfect …. But sometimes it doesn’t matter too much • Fast and easy

  25. Summing Up • FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology • But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer • Next time: • Read Ch 4 • Read over HW1 and ask questions now

More Related