Corpora and Statistical Methods

Corpora and Statistical Methods – Lecture 3 Albert Gatt

Part 2 Morphology and productivity

Morphology • Many languages have multiple word forms related to a single base form (root form) • Lexeme= base form from which related forms are produced • Three classes ofproductive morphological processes: • Inflection • Derivation • Compounding

Inflection • Addition of prefixes and suffixes that • leave core meaning intact • leave grammatical category intact • add/alter some features of meaning (especially relevant to syntax) • Examples: • -s to form plural nouns • -ed to form past tense

Derivation • Addition of prefixes and suffixes which: • result in a more radical change in meaning • often result in change of syntactic category • Examples: • English -ly (ADJADV): wide-ly • English -en (ADJV): weak-en • English -able (VADJ): accept-able

Compounding • Combination of two independent words into a new word • NB new word can be orthographically one or several words • can cause recognisable changes in phonology • new compound has a new meaning (not necessarily 100% compositional) • Example: English N-N compounds • disk drive, mad cow disease, credit crunch

Regular vs. irregular • Inflectional and derivational rules often have exceptions. • E.g. Past tense in English: • regular: -ed suffix • irregular: bring – brought, ring - rang etc • Sub-regularities observable: • -ing/k verbs in English seem to display a particular pattern: • rang, sank, …

Productive vs non-productive • Some morphological processes or categories seem to have greater potential to form new words than others • e.g. English -able, -ness • compare to English –th: warmth, strength… (much less productive)

Classical approaches to productivity • Jackendoff (1975): • unproductive rules are called redundancy rules: • e.g. warmth is listed in the English speaker’s (mental) lexicon as a single word • the redundancy rule captures the knowledge that it can be split into warm+th • rule as such isn’t really “active”, i.e. forms not produced online • contrast withproductive rules: • e.g. Many adjectives with –able are produced “online”, not stored

Features of classical approaches • Relies on a binary distinction (un/productive) • Productive rules are typically regular & sub-regularities not considered much (Dressler 2003) • Most of these approaches do not look at corpus data • Related psycholinguistic model: Pinker’s (1997) dual-route model of morphological processing

Corpus-based approaches • View productivity as a gradable phenomenon: • some forms become ingrained through frequent usage • category can still be productive to some extent • productivity estimated in terms of a category’s potential to produce new forms • can account for sub-regularities: productivity of a category is due to a lot of factors, including analogy to existing words

The continuum ADJ+ness Noun ADJ+th Noun lexicalised word Productive morphological process • Productive processes tend to: • be compositional • result in a lot of new words

Practical application (I) • No finite lexicon can contain all words of a language at a certain time • productive processes can be exploited to parse new/unseen lexical items • this is helped by the compositionality of productive processes • can also help to distinguish creative neologism from systematic rule-application. compare: • well-defined, well-intentioned, well-specified • lots of adjectives with a well- prefix • YouTube • a one-off

Practical application (II) • Polarity/sentiment analysis: • aim is to identify the overall positive/negative slant of a text concerning a topic • Moilanen and Pulman (2008) obtain improvements by considering adjectives formed with well- vs –infested etc

Theoretical implications • raises interesting questions about the relationship between corpus-based measures and psycholinguistic data • likelihood of a morphological process being applied depends on style, genre, speech community… • can give an indication of language change over time (some processes are fossilised, others become more productive)

Statistical measures of productivity (Baayen 2006)

What we need • A measure of productivity of a process/category C should reflect: • our intuitions about how frequently we encounter C • how easily native speakers can form new words using C • Is it easier to produce a noun with –th (like warmth) or one with –ness (like goodness)?

Realised productivity (RP) • Given a morphological category C, RP gives a rough indication of the past utility of C in forming new words. • Measured as the number of distinct types formed usingC in a corpus of size N. • E.g. regular past tense –ed displays many more types than sub-regular forms such as keep-kept/sleep-slept

Realised productivity cont/d • Why types, not tokens? • Productive processes have lots of types which are hapaxes, or are very infrequent. • Words formed from irregular processes tend to be very frequent. • Some limitations: • a high RP for a category does not imply that it will keep forming lots of new words • RP is heavily dependent on corpus size

Expanding productivity (P*) • P* gives a rough indication of the rate of expansion of C. • Focuses on the number of hapaxes produced using C in the corpus. • aka hapax-conditioned productivity • NB: P* is still heavily dependent on corpus size! No. of types formed usingC which occur only once in N tokens No. of hapaxes in the corpus

Potential productivity (P) • Gives an indication of how likely a category C is to form new words in future. • I.e. the potential for C to be already saturated • aka category-conditioned productivity No. of types in C which occur only once in corpus of N tokens No. of tokens of category C

Some more on P • Unlike RP and P*, P is not very sensitive to corpus size as such • However, very sensitive to frequency of the category. • e.g. if C is realised only once in a corpus of size N, then P = 1! • Recent empirical work has shown that RP and P* correlate very strongly, but both exhibit a weak correlation with P (Vegnaduzzo 2009) • pattern non-X has high RP and P*, but low P • pattern X-ish has low RP and P*, but high P

In graphics (after Baayen 2006) Slope of tangent represents growth rate No. of types Growth curve for a specific category Corpus size

P vs. RP and P* • A category can have low RP and P*, but high P. • Corresponds to the “ease” with which new words can be formed using the category. • Even though a category has high RP, it may have reached “saturation”, so have low P.

The psycholinguistic connection • Rule vs. direct access: • To produce a word (e.g. illegal), you can either store it directly, or apply the rule on the fly. • Evidence suggests that frequency of baseform vs. derivation is related to which of the two alternatives apply.

The psycholinguistic connection • Complexity-based affix ordering: • Corpus research: more productive affixes follow less productive ones in word formation • It seems that more highly predictable (low productivity) affixes are processed first. • High productivity may also imply less likelihood of entering into further derivational processes.

Works cited • S. Vegnaduzzo (2009). Morphological productivity rankings of complex adjectives. Proc. NAACL-HLT Workshop on Computational Approaches to Linguistic Creativity. • K. Molinen and S. Pulman (2008). The good, the bad and the unknown: Morphosyllabic sentiment tagging of unseen words. Proc. ACL 2008 • Baayen 2006 linked from web page

Corpora and Statistical Methods – Lecture 3