1 / 27

Corpora and Statistical Methods – Lecture 3

Corpora and Statistical Methods – Lecture 3. Albert Gatt. Part 2. Morphology and productivity. Morphology. Many languages have multiple word forms related to a single base form ( root form ) Lexeme = base form from which related forms are produced

Download Presentation

Corpora and Statistical Methods – Lecture 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Corpora and Statistical Methods – Lecture 3 Albert Gatt

  2. Part 2 Morphology and productivity

  3. Morphology • Many languages have multiple word forms related to a single base form (root form) • Lexeme= base form from which related forms are produced • Three classes ofproductive morphological processes: • Inflection • Derivation • Compounding

  4. Inflection • Addition of prefixes and suffixes that • leave core meaning intact • leave grammatical category intact • add/alter some features of meaning (especially relevant to syntax) • Examples: • -s to form plural nouns • -ed to form past tense

  5. Derivation • Addition of prefixes and suffixes which: • result in a more radical change in meaning • often result in change of syntactic category • Examples: • English -ly (ADJADV): wide-ly • English -en (ADJV): weak-en • English -able (VADJ): accept-able

  6. Compounding • Combination of two independent words into a new word • NB new word can be orthographically one or several words • can cause recognisable changes in phonology • new compound has a new meaning (not necessarily 100% compositional) • Example: English N-N compounds • disk drive, mad cow disease, credit crunch

  7. Regular vs. irregular • Inflectional and derivational rules often have exceptions. • E.g. Past tense in English: • regular: -ed suffix • irregular: bring – brought, ring - rang etc • Sub-regularities observable: • -ing/k verbs in English seem to display a particular pattern: • rang, sank, …

  8. Productive vs non-productive • Some morphological processes or categories seem to have greater potential to form new words than others • e.g. English -able, -ness • compare to English –th: warmth, strength… (much less productive)

  9. Classical approaches to productivity • Jackendoff (1975): • unproductive rules are called redundancy rules: • e.g. warmth is listed in the English speaker’s (mental) lexicon as a single word • the redundancy rule captures the knowledge that it can be split into warm+th • rule as such isn’t really “active”, i.e. forms not produced online • contrast withproductive rules: • e.g. Many adjectives with –able are produced “online”, not stored

  10. Features of classical approaches • Relies on a binary distinction (un/productive) • Productive rules are typically regular & sub-regularities not considered much (Dressler 2003) • Most of these approaches do not look at corpus data • Related psycholinguistic model: Pinker’s (1997) dual-route model of morphological processing

  11. Corpus-based approaches • View productivity as a gradable phenomenon: • some forms become ingrained through frequent usage • category can still be productive to some extent • productivity estimated in terms of a category’s potential to produce new forms • can account for sub-regularities: productivity of a category is due to a lot of factors, including analogy to existing words

  12. The continuum ADJ+ness Noun ADJ+th Noun lexicalised word Productive morphological process • Productive processes tend to: • be compositional • result in a lot of new words

  13. Practical application (I) • No finite lexicon can contain all words of a language at a certain time • productive processes can be exploited to parse new/unseen lexical items • this is helped by the compositionality of productive processes • can also help to distinguish creative neologism from systematic rule-application. compare: • well-defined, well-intentioned, well-specified • lots of adjectives with a well- prefix • YouTube • a one-off

  14. Practical application (II) • Polarity/sentiment analysis: • aim is to identify the overall positive/negative slant of a text concerning a topic • Moilanen and Pulman (2008) obtain improvements by considering adjectives formed with well- vs –infested etc

  15. Theoretical implications • raises interesting questions about the relationship between corpus-based measures and psycholinguistic data • likelihood of a morphological process being applied depends on style, genre, speech community… • can give an indication of language change over time (some processes are fossilised, others become more productive)

  16. Statistical measures of productivity (Baayen 2006)

  17. What we need • A measure of productivity of a process/category C should reflect: • our intuitions about how frequently we encounter C • how easily native speakers can form new words using C • Is it easier to produce a noun with –th (like warmth) or one with –ness (like goodness)?

  18. Realised productivity (RP) • Given a morphological category C, RP gives a rough indication of the past utility of C in forming new words. • Measured as the number of distinct types formed usingC in a corpus of size N. • E.g. regular past tense –ed displays many more types than sub-regular forms such as keep-kept/sleep-slept

  19. Realised productivity cont/d • Why types, not tokens? • Productive processes have lots of types which are hapaxes, or are very infrequent. • Words formed from irregular processes tend to be very frequent. • Some limitations: • a high RP for a category does not imply that it will keep forming lots of new words • RP is heavily dependent on corpus size

  20. Expanding productivity (P*) • P* gives a rough indication of the rate of expansion of C. • Focuses on the number of hapaxes produced using C in the corpus. • aka hapax-conditioned productivity • NB: P* is still heavily dependent on corpus size! No. of types formed usingC which occur only once in N tokens No. of hapaxes in the corpus

  21. Potential productivity (P) • Gives an indication of how likely a category C is to form new words in future. • I.e. the potential for C to be already saturated • aka category-conditioned productivity No. of types in C which occur only once in corpus of N tokens No. of tokens of category C

  22. Some more on P • Unlike RP and P*, P is not very sensitive to corpus size as such • However, very sensitive to frequency of the category. • e.g. if C is realised only once in a corpus of size N, then P = 1! • Recent empirical work has shown that RP and P* correlate very strongly, but both exhibit a weak correlation with P (Vegnaduzzo 2009) • pattern non-X has high RP and P*, but low P • pattern X-ish has low RP and P*, but high P

  23. In graphics (after Baayen 2006) Slope of tangent represents growth rate No. of types Growth curve for a specific category Corpus size

  24. P vs. RP and P* • A category can have low RP and P*, but high P. • Corresponds to the “ease” with which new words can be formed using the category. • Even though a category has high RP, it may have reached “saturation”, so have low P.

  25. The psycholinguistic connection • Rule vs. direct access: • To produce a word (e.g. illegal), you can either store it directly, or apply the rule on the fly. • Evidence suggests that frequency of baseform vs. derivation is related to which of the two alternatives apply.

  26. The psycholinguistic connection • Complexity-based affix ordering: • Corpus research: more productive affixes follow less productive ones in word formation • It seems that more highly predictable (low productivity) affixes are processed first. • High productivity may also imply less likelihood of entering into further derivational processes.

  27. Works cited • S. Vegnaduzzo (2009). Morphological productivity rankings of complex adjectives. Proc. NAACL-HLT Workshop on Computational Approaches to Linguistic Creativity. • K. Molinen and S. Pulman (2008). The good, the bad and the unknown: Morphosyllabic sentiment tagging of unseen words. Proc. ACL 2008 • Baayen 2006 linked from web page

More Related