Unsupervised Learning of Natural Language Morphology using MDL

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001

Today’s presentation • The task: unsupervised learning • Overview of program and output • Overview of Minimum Description Length framework • Application of MDL to iterative search of morphology-space, with successively finer-grained descriptions • Mathematical model • Current capabilities • Current challenges

Unsupervised learning • Input: untagged text in orthographic or phonetic form • with spaces (or punctuation) separating words. • But no tagging or text preparation.

Overview of program and output • Linguistica: a C++ Windows-based program available for download at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000 • Technical discussion in Computational Linguistics (June 2001) • Good results with 5,000 words, very fine-grained results with 500,000 words (corpus length, not lexicon count).

Output • List of stems, suffixes, and prefixes • List of signatures. • A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. • Hence, a stem in a corpus has a unique signature. • A signature has a unique set of stems associated with it • …

(example of signature in English) • NULL.ed.ing.s ask call point = ask asked asking asks call called calling calls point pointed pointing points

…output • Roots (“stems of stems”) and the inner structure of stems • Regular allomorphy of stems: e.g., learn “delete stem-final –e in English before –ing and –ed”

Minimum Description Length (MDL) • Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) • Work by Michael Brent and Carl de Marcken on word-discovery using MDL

Essence of MDL We are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non-negative numbers summing to 1.0.)

The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data. • Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL… • The goodness of the morphology is also measured by how compact the morphology is. • We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL… The best overall theory of a corpus is the one for which the sum of • log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.

Essence of MDL…

Overall logic • Search through morphology space for the morphology which provides the smallest description length.

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Feed it into the “bootstrapping” heuristic... Bootstrap heuristic

Corpus Bootstrap heuristic Out of which comes a preliminary morphology, which need not be superb. Morphology

Corpus Bootstrap heuristic Feed it to the incremental heuristics... Morphology incremental heuristics

Corpus Out comes a modified morphology. Bootstrap heuristic Morphology modified morphology incremental heuristics

Corpus Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics

Corpus If it is an improvement, replace the morphology... Bootstrap heuristic modified morphology Morphology Garbage

Corpus Send it back to the incremental heuristics again... Bootstrap heuristic modified morphology incremental heuristics

Continue until there are no improvements to try. Morphology modified morphology incremental heuristics

1. Bootstrap heuristic • A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes. • In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

Successor frequency n g o v e r Empirically, only one letter follows “gover”: “n”

Successor frequency e i m g o v e r n o s # Empirically, 6 letters follows “govern”: “n”

Successor frequency g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors… 9 18 11 6 4 1 2 1 1 2 1 1 c o n s e r v a t i v e s wrong right wrong

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics Course-grained to fine-grained • 1. Stems and suffixes to split: • Accept any analysis of a word if it consists of a known stem and a known suffix. • 2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. • Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic • 3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?

Using MDL to judge a potential stem act, acted, action, acts. We have the suffixes NULL, ed, ion, and s, but no signature NULL.ed.ion.s Let’s compute cost versus savings of signature NULL.ed.ion.s Savings: Stem savings: 3 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s • A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is -- all the stems using it chip in to pay for its cost, though.

Cost of signature: about 45 bits • Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Model • A model to give us a probability of each word in the corpus (hence, its optimal compressed length); and • A morphology whose length we can measure.

Frequency of analyzed word W is analyzed as belonging to Signature s,stem T and suffix F. [x] means the count of x’s in the corpus (token count) Where [W] is the total number of words. Actually what we care about is the log of this:

Next, let’s see how to measurethe length of a morphology A morphology is a set of 3 things: • A list of stems; • A list of suffixes; • A list of signatures with the associated stems. We’ll make an effort to make our grammars consist primarily of lists, whose length is conceptually simple.

Length of a list • A header telling us how long the list is, of length (roughly) log2 N, where N is the length. • N entries. What’s in an entry? • Raw lists: a list of strings of letters, where the length of each letter is log2 (26) – the information content of a letter (we can use a more accurate conditional probability). • Pointer lists:

Raw suffix list: ed s ing ion able … Signature 1: Suffixes: pointer to “ing” pointer to “ed” Signature 2: Suffixes pointer to “ing” pointer to “ion” Lists The length of each pointer is -- usually cheaper than the letters themselves

The fact that a pointer to a symbol has a length that is inversely proportional to its frequency is the key: • We want the shortest overall grammar; so • That means maximizing the re-use of units (stems, affixes, signatures, etc.)

Unsupervised Learning of Natural Language Morphology using MDL

Unsupervised Learning of Natural Language Morphology using MDL

Presentation Transcript

Linguistica : Unsupervised Learning of Natural Language Morphology Using MDL

Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm

Unsupervised Learning

Supervised and Unsupervised learning for Natural language processing

Unsupervised Learning

Unsupervised learning

Language Technology Machine learning of natural language

Unsupervised Learning

Acquisition of Morphology by Computer: Unsupervised learning

Unsupervised learning of natural language morphology

MDL and the complexity of natural language

Unsupervised learning of Natural languages

Unsupervised learning

Unsupervised Learning

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning