LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 24 • 4/15/2013

Recommended reading • John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2). • John Goldsmith. 2007. Towards a new empiricism. MS. • E. Chan and C. Lignos. 2011. Investigating the relationship between linguistic representation and computation through an unsupervised model of human morphology learning. Research in Language and Computation.

Outline • Model selection • Unsupervised learning of morphology • Goldsmith (2001) • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)

Underlying issues in machine learning • Learning bias: • Set of assumptions that are made by a model • The representation of the data • The learning algorithm • Function being optimized • Procedure for the optimization • Model selection: • Want to choose the most appropriate model for data • We usually optimize the configuration of the model by making use of training data

Examples of learning bias • We have a data set. • In a generative model, we assume data was generated by: • PCFG • Markov model • Bayes net • In discriminative/predictive models, we assume the data is: • Linearly separable (Perceptron) • Linearly separable in a higher-order dimension (SVM) • Parallel to feature axes (Decision Tree)

Model selection • Given a particular structural model for the data, how do we find the best instantiation of the model for the data? • i.e., want to provide probabilities and other numerical parameters over components of model (vocabulary, grammar, states, transitions, etc.) • In other words, we want to learn a model for the data • Very difficult problem • Want to find a mathematical formula that will give us the best model for any data set

Learn the model through some optimization criterion • Use annotated training data to configure parameters, etc. of the model • Set parameters to maximize p(Data | Model) • Generative: Markov models, Naive Bayes, PCFG, etc. • Maximize Likelihood Estimation • Maximize Likelihood Estimation + smoothing • Discriminative/predictive: • Perceptron: find the first separating hyperplane • SVM: find max margin hyperplane • Decision tree: maximize information gain at each node • MaxEnt, CRF: maximize conditional likelihood: p(Y|X)

Model selection through test set • Problem of overfitting • By iteratively optimizing against training data, one may produce a model that is too specific to the training set • But the training set is a only a sample of data • An overfitted model may perform poorly on other data from the same distribution • Prevent overfitting by computing performance on a testing set • Error rate on test set increases if the model overfits • Pick the model in which error rate on test set is minimized • (Of course, if you then optimize against the testing set, you can overfit on the testing set, too)

What’s the most appropriate model for a data set? • Want to determine the best overall model, in terms of both: • Choice of model structure (learning bias) • Find a specific configuration of model parameters through parameter optimization on training set • Need to also search over the space of models • In addition to searching over the space of parameters for a particular model structure

Occam’s Razor • Occam’s Razor: • Given two theories that are equally good in explaining the data, the simpler theory is better • i.e., given two models that account for the data equally well, the simpler model is to be preferred

Occam’s Razor: example • Suppose our data is the infinite language L = { an | n >= 0 } • Model with a regular expression • regexp 1: a* • regexp 2: a*a* • Both generate the same language: • L(regexp 1) = L(regexp 2) = L • Since both account for the data equally well, and regexp 1 is simpler than regexp 2, regexp 1 should be preferred

Minimum description length principle (MDL) • Now, suppose we have competing theories, but they don’t explain the data equally well • Theory #1 explains the data better but is more complicated • Theory #2 explains the data worse, but is more simplistic • How do we pick the better theory? • Describe a model by a sequence of bits • Minimum Description Length principle: • The best model for a set of data is that which minimizes the size (in bits) of: the description of model + the description of the data according to the model

MDL and probability • Minimize # of bits for description of model + description of data according to model = minimize log2 p(Model) + log2 p(Data|Model) • p(Data|Model) is the Likelihood of the data • So, learning algorithms that optimize for maximum likelihood are ignoring p(Model) • Model comparison in terms of performance on test sets is just comparing p(Data|Model)

Model selection in unsupervised learning • Unsupervised learning: • Have a data set without labels • Therefore cannot assess performance of model during training/learning, in contrast to supervised learning • To choose a model, we could: • Apply MDL: minimize log2 p(Model) + log2 p(Data|Model) • Or maximize likelihood: log2 p(Data|Model)

Outline • Model selection • Unsupervised learning of morphology • Goldsmith (2001) • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)

Morphology • Words occur in morphological forms • For simplicity, we’ll only consider inflectional morphology • English nouns:

English verbs

Spanish verbs

Explaining word relationships • Data: • “bake” means: BAKE • “baked” means: BAKE + PAST-TENSE • Variety of possible explanations: • 1. “bake” is transformed into “baked” through the application of a rule that introduces PAST-TENSE by adding -ed. • 2. BAKE is realized as “bake”, and PAST-TENSE is realized as “+ed”, and the combination of BAKE and PAST-TENSE produces “baked”. • 3. Two separate words “bake” and “baked” with individual meanings. • 4. Two separate strings “bake” and “baked”. “baked” is formed from “bake” through concatenation of “d” (or deletion of “e” and concatenation of “ed”).

Unsupervised learning of morphology   Morphological grammar Corpus Data structures + algorithms

Unsupervised learning of morphology • Input: corpus of text • No annotations for morphological structure • Goal: learn a way to generate strings for morphologically related words • e.g., learn that “bake” and “baked” are morphologically related, and how to generate them • In text, we don’t have access to semantic elements such as PAST-TENSE, PLURAL, NOMINATIVE, NOUN, VERB, etc. • Learning algorithm • Encodes a formal grammar • Simple models of morphology from linguistics: • Paradigms, Base + rules, Analogical network • Which type of grammar should we choose? • Output: instantiated morphological grammar

Morphological grammar #1: paradigms • Tables of word forms • Words are cross-categorized by different combinations of inflectional features (tense, person, number, case, etc.)

Morphological grammar #2: Rules Rule: A  B / C _ D Lexicon base1: C1 base2: C1 base3: C2 base4: C2 base5: C1 … Lexical categories C1 C2 base base

Generate word forms through rules • Take base form, apply rules • BaseRules • amo o  as o  at o  amus o  atis o  ant • Paradigms are epiphenomenal • Apply rules to a base form to generate a paradigm

Morphological grammar #3:analogical networks • Butterworth 1983, Plaut & Gonnerman 2000, Pierrehumbert 2001, Burzio 2005, Albright 2005, Hay & Baayen 2005, Ninio 2006 • Mental representation of words: large network of all words • Words are related through analogy • Syntactic, phonological, semantic, frequency similarity • Words do not have decompositionalstructure • Item based learning: store every word you encounter • No concrete proposals about generalization…

Hay & Baayen 2005

Choosing a theory • Linguistic theories are often used for the purpose of describing data • Account for unusual constructions in languages • But many linguists are also interested in the possible mental reality of theories • Formulate a cognitive, computational model of morphology learning to choose a theory • Not all linguistic theories will work in such a model • Cognitive = proposed theory is constrained by limitations of human learning and processing • Computational = implemented on computer and actually works

Outline • Model selection • Unsupervised learning of morphology • Goldsmith (2001) • I borrowed some of Goldsmith’s slides • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)

Goldsmith (2007)“Towards a New Empiricism” • Inspired by Zellig Harris • Empiricist, Structuralist, pre-Chomsky • Looking for automatic ways for a Linguist to discover a grammar from a sample of that language • As an alternative to other methodologies in linguistics: grammaticality judgments, etc. • Given two different grammars that account for the data to different degrees, the better grammar can be selected through MDL

Goldsmith 2001 • Unsupervised learning of morphology • First widely-cited paper on the subject • Download software: • http://linguistica.uchicago.edu/

Goldsmith 2001 • Learns a morphological grammar from the set of words in a corpus • Doesn’t use word frequency • Grammar is represented as a set of “signatures” • Similar to paradigms • Iterative learning procedure • Search through the space of possible grammars, • Choose the “best” grammar through MDL

Signature grammar The first 8 stems in the largest signature in a 500,000 word corpus of English. Set of suffixes that appears with all of these stems

How should words be segmented? One segmentation Another segmentation

Find a balance between extremes in representation of the data • Given a vocabulary, • We could have a grammar with a single signature, and a stem for every word in the vocabulary • Signature would contain a null suffix • We could have a grammar with one signature for each word in the vocabulary, where each signature contains exactly one stem • We could have an intermediate number of signatures, with a moderate number of stems and suffixes

Grammar #1: one signature containing many stems and one suffix • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1 • Stems: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Suffixes: { NULL }

Grammar #1.5: one signature containing one stem and many suffixes • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1 • Stems: { NULL } • Suffixes: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking }

Grammar #2: many signatures, each with one stem and one suffix • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1: { dog }, { NULL } • Signature #2: { dogs }, { NULL } • Signature #3: { cat }, { NULL } • Signature #4: { cats }, { NULL } • etc.

Grammar #3: balanced grammar with multiple signatures (best) • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1: • Stems: { dog, cat } • Suffixes: { NULL, s } • Signature #2: • Stems: { jump, walk } • Suffixes: { NULL, s, ed, ing }

Corpus Goldsmith’s algorithm Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Goldsmith’s algorithm Feed it into the “bootstrapping” heuristic... (Harris’ segmentation algorithm) Bootstrap heuristic

Corpus Goldsmith’s algorithm Bootstrap heuristic Out of which comes a preliminary morphology, which need not be superb. Morphology

Corpus Goldsmith’s algorithm Bootstrap heuristic Feed it to the incremental heuristics... Morphology incremental heuristics

Corpus Goldsmith’s algorithm Out comes a modified morphology. Bootstrap heuristic Morphology modified morphology incremental heuristics

Corpus Goldsmith’s algorithm Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics

Corpus Goldsmith’s algorithm If it is an improvement, replace the morphology... Bootstrap heuristic modified morphology Morphology Garbage

Corpus Goldsmith’s algorithm Send it back to the incremental heuristics again... Bootstrap heuristic modified morphology incremental heuristics

Goldsmith’s algorithm Continue until there are no improvements to try. Morphology modified morphology incremental heuristics

Minimal Description Length: want to balance cost of different components of grammar Toy example: cost of a grammar is the total number of letters Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump, laugh, sing, sang, dog (20 letters) Suffixes: s, ing, ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters.

Different segmentations of the same data incur different costs G1: 3 stems, 12 suffixes, 2 sigs G2: 2 stems, 6 suffixes, 2 sigs

Choosing a morphological grammar • Apply MDL to assess the quality of a grammar • Come up with a formula for the quality of a grammar, according to the components of the grammar • In each iteration: • From the current grammar, generate a series of alternative grammars • Calculate description length of each of these grammars • Choose the grammar with lowest description length

LING / C SC 439/539 Statistical Natural Language Processing