Loading in 2 Seconds...

LING / C SC 439/539 Statistical Natural Language Processing

Loading in 2 Seconds...

166 Views

Download Presentation
## LING / C SC 439/539 Statistical Natural Language Processing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**LING / C SC 439/539Statistical Natural Language Processing**• Lecture 24 • 4/15/2013**Recommended reading**• John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2). • John Goldsmith. 2007. Towards a new empiricism. MS. • E. Chan and C. Lignos. 2011. Investigating the relationship between linguistic representation and computation through an unsupervised model of human morphology learning. Research in Language and Computation.**Outline**• Model selection • Unsupervised learning of morphology • Goldsmith (2001) • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)**Underlying issues in machine learning**• Learning bias: • Set of assumptions that are made by a model • The representation of the data • The learning algorithm • Function being optimized • Procedure for the optimization • Model selection: • Want to choose the most appropriate model for data • We usually optimize the configuration of the model by making use of training data**Examples of learning bias**• We have a data set. • In a generative model, we assume data was generated by: • PCFG • Markov model • Bayes net • In discriminative/predictive models, we assume the data is: • Linearly separable (Perceptron) • Linearly separable in a higher-order dimension (SVM) • Parallel to feature axes (Decision Tree)**Model selection**• Given a particular structural model for the data, how do we find the best instantiation of the model for the data? • i.e., want to provide probabilities and other numerical parameters over components of model (vocabulary, grammar, states, transitions, etc.) • In other words, we want to learn a model for the data • Very difficult problem • Want to find a mathematical formula that will give us the best model for any data set**Learn the model through some optimization criterion**• Use annotated training data to configure parameters, etc. of the model • Set parameters to maximize p(Data | Model) • Generative: Markov models, Naive Bayes, PCFG, etc. • Maximize Likelihood Estimation • Maximize Likelihood Estimation + smoothing • Discriminative/predictive: • Perceptron: find the first separating hyperplane • SVM: find max margin hyperplane • Decision tree: maximize information gain at each node • MaxEnt, CRF: maximize conditional likelihood: p(Y|X)**Model selection through test set**• Problem of overfitting • By iteratively optimizing against training data, one may produce a model that is too specific to the training set • But the training set is a only a sample of data • An overfitted model may perform poorly on other data from the same distribution • Prevent overfitting by computing performance on a testing set • Error rate on test set increases if the model overfits • Pick the model in which error rate on test set is minimized • (Of course, if you then optimize against the testing set, you can overfit on the testing set, too)**What’s the most appropriate model for a data set?**• Want to determine the best overall model, in terms of both: • Choice of model structure (learning bias) • Find a specific configuration of model parameters through parameter optimization on training set • Need to also search over the space of models • In addition to searching over the space of parameters for a particular model structure**Occam’s Razor**• Occam’s Razor: • Given two theories that are equally good in explaining the data, the simpler theory is better • i.e., given two models that account for the data equally well, the simpler model is to be preferred**Occam’s Razor: example**• Suppose our data is the infinite language L = { an | n >= 0 } • Model with a regular expression • regexp 1: a* • regexp 2: a*a* • Both generate the same language: • L(regexp 1) = L(regexp 2) = L • Since both account for the data equally well, and regexp 1 is simpler than regexp 2, regexp 1 should be preferred**Minimum description length principle (MDL)**• Now, suppose we have competing theories, but they don’t explain the data equally well • Theory #1 explains the data better but is more complicated • Theory #2 explains the data worse, but is more simplistic • How do we pick the better theory? • Describe a model by a sequence of bits • Minimum Description Length principle: • The best model for a set of data is that which minimizes the size (in bits) of: the description of model + the description of the data according to the model**MDL and probability**• Minimize # of bits for description of model + description of data according to model = minimize log2 p(Model) + log2 p(Data|Model) • p(Data|Model) is the Likelihood of the data • So, learning algorithms that optimize for maximum likelihood are ignoring p(Model) • Model comparison in terms of performance on test sets is just comparing p(Data|Model)**Model selection in unsupervised learning**• Unsupervised learning: • Have a data set without labels • Therefore cannot assess performance of model during training/learning, in contrast to supervised learning • To choose a model, we could: • Apply MDL: minimize log2 p(Model) + log2 p(Data|Model) • Or maximize likelihood: log2 p(Data|Model)**Outline**• Model selection • Unsupervised learning of morphology • Goldsmith (2001) • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)**Morphology**• Words occur in morphological forms • For simplicity, we’ll only consider inflectional morphology • English nouns:**Explaining word relationships**• Data: • “bake” means: BAKE • “baked” means: BAKE + PAST-TENSE • Variety of possible explanations: • 1. “bake” is transformed into “baked” through the application of a rule that introduces PAST-TENSE by adding -ed. • 2. BAKE is realized as “bake”, and PAST-TENSE is realized as “+ed”, and the combination of BAKE and PAST-TENSE produces “baked”. • 3. Two separate words “bake” and “baked” with individual meanings. • 4. Two separate strings “bake” and “baked”. “baked” is formed from “bake” through concatenation of “d” (or deletion of “e” and concatenation of “ed”).**Unsupervised learning of morphology** Morphological grammar Corpus Data structures + algorithms**Unsupervised learning of morphology**• Input: corpus of text • No annotations for morphological structure • Goal: learn a way to generate strings for morphologically related words • e.g., learn that “bake” and “baked” are morphologically related, and how to generate them • In text, we don’t have access to semantic elements such as PAST-TENSE, PLURAL, NOMINATIVE, NOUN, VERB, etc. • Learning algorithm • Encodes a formal grammar • Simple models of morphology from linguistics: • Paradigms, Base + rules, Analogical network • Which type of grammar should we choose? • Output: instantiated morphological grammar**Morphological grammar #1: paradigms**• Tables of word forms • Words are cross-categorized by different combinations of inflectional features (tense, person, number, case, etc.)**Morphological grammar #2: Rules**Rule: A B / C _ D Lexicon base1: C1 base2: C1 base3: C2 base4: C2 base5: C1 … Lexical categories C1 C2 base base**Generate word forms through rules**• Take base form, apply rules • BaseRules • amo o as o at o amus o atis o ant • Paradigms are epiphenomenal • Apply rules to a base form to generate a paradigm**Morphological grammar #3:analogical networks**• Butterworth 1983, Plaut & Gonnerman 2000, Pierrehumbert 2001, Burzio 2005, Albright 2005, Hay & Baayen 2005, Ninio 2006 • Mental representation of words: large network of all words • Words are related through analogy • Syntactic, phonological, semantic, frequency similarity • Words do not have decompositionalstructure • Item based learning: store every word you encounter • No concrete proposals about generalization…**Choosing a theory**• Linguistic theories are often used for the purpose of describing data • Account for unusual constructions in languages • But many linguists are also interested in the possible mental reality of theories • Formulate a cognitive, computational model of morphology learning to choose a theory • Not all linguistic theories will work in such a model • Cognitive = proposed theory is constrained by limitations of human learning and processing • Computational = implemented on computer and actually works**Outline**• Model selection • Unsupervised learning of morphology • Goldsmith (2001) • I borrowed some of Goldsmith’s slides • Zipf’s Law and morphology • Cognitive models of morphology learning • Chan & Lignos (2011)**Goldsmith (2007)“Towards a New Empiricism”**• Inspired by Zellig Harris • Empiricist, Structuralist, pre-Chomsky • Looking for automatic ways for a Linguist to discover a grammar from a sample of that language • As an alternative to other methodologies in linguistics: grammaticality judgments, etc. • Given two different grammars that account for the data to different degrees, the better grammar can be selected through MDL**Goldsmith 2001**• Unsupervised learning of morphology • First widely-cited paper on the subject • Download software: • http://linguistica.uchicago.edu/**Goldsmith 2001**• Learns a morphological grammar from the set of words in a corpus • Doesn’t use word frequency • Grammar is represented as a set of “signatures” • Similar to paradigms • Iterative learning procedure • Search through the space of possible grammars, • Choose the “best” grammar through MDL**Signature grammar**The first 8 stems in the largest signature in a 500,000 word corpus of English. Set of suffixes that appears with all of these stems**How should words be segmented?**One segmentation Another segmentation**Find a balance between extremes in representation of the**data • Given a vocabulary, • We could have a grammar with a single signature, and a stem for every word in the vocabulary • Signature would contain a null suffix • We could have a grammar with one signature for each word in the vocabulary, where each signature contains exactly one stem • We could have an intermediate number of signatures, with a moderate number of stems and suffixes**Grammar #1: one signature containing many stems and one**suffix • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1 • Stems: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Suffixes: { NULL }**Grammar #1.5: one signature containing one stem and many**suffixes • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1 • Stems: { NULL } • Suffixes: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking }**Grammar #2: many signatures, each with one stem and one**suffix • Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1: { dog }, { NULL } • Signature #2: { dogs }, { NULL } • Signature #3: { cat }, { NULL } • Signature #4: { cats }, { NULL } • etc.**Grammar #3: balanced grammar with multiple signatures (best)**• Vocabulary: { dog, dogs, cat, cats, jump, jumps, jumped, jumping, walk, walks, walked, walking } • Signature #1: • Stems: { dog, cat } • Suffixes: { NULL, s } • Signature #2: • Stems: { jump, walk } • Suffixes: { NULL, s, ed, ing }**Corpus**Goldsmith’s algorithm Pick a large corpus from a language -- 5,000 to 1,000,000 words.**Corpus**Goldsmith’s algorithm Feed it into the “bootstrapping” heuristic... (Harris’ segmentation algorithm) Bootstrap heuristic**Corpus**Goldsmith’s algorithm Bootstrap heuristic Out of which comes a preliminary morphology, which need not be superb. Morphology**Corpus**Goldsmith’s algorithm Bootstrap heuristic Feed it to the incremental heuristics... Morphology incremental heuristics**Corpus**Goldsmith’s algorithm Out comes a modified morphology. Bootstrap heuristic Morphology modified morphology incremental heuristics**Corpus**Goldsmith’s algorithm Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics**Corpus**Goldsmith’s algorithm If it is an improvement, replace the morphology... Bootstrap heuristic modified morphology Morphology Garbage**Corpus**Goldsmith’s algorithm Send it back to the incremental heuristics again... Bootstrap heuristic modified morphology incremental heuristics**Goldsmith’s algorithm**Continue until there are no improvements to try. Morphology modified morphology incremental heuristics**Minimal Description Length: want to balance cost of**different components of grammar Toy example: cost of a grammar is the total number of letters Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump, laugh, sing, sang, dog (20 letters) Suffixes: s, ing, ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters.**Different segmentations of the same data incur different**costs G1: 3 stems, 12 suffixes, 2 sigs G2: 2 stems, 6 suffixes, 2 sigs**Choosing a morphological grammar**• Apply MDL to assess the quality of a grammar • Come up with a formula for the quality of a grammar, according to the components of the grammar • In each iteration: • From the current grammar, generate a series of alternative grammars • Calculate description length of each of these grammars • Choose the grammar with lowest description length