Linguistica : Unsupervised Learning of Natural Language Morphology Using MDL. John Goldsmith Department of Linguistics The University of Chicago. The Goal:. To develop a program that learns the structure of words in any human language on the basis of a raw text.
Department of Linguistics
The University of Chicago
If you give the program a computer file containing Tom Sawyer, it should tell you that the language has a category of words that take the suffixes ing,s,ed, and NULL; another category that takes the suffixes 's, s, and NULL;
If you give it Jules Verne, it tells you there's a category with suffixes:
a aient ait ant (chanta, chantaient, chantait, chantant)
Why using MDL is closely related to measuring the (log of the) size of the space of possible vocabularies.
A good analysis of a set of data is one that (1) extracts the structure found in the data, and (2) which does it without overfitting the data.
If you have a set of pointers to a bunch of objects, and a probability distribution over those pointers, then
You may act as if the information-length of each pointer =
-1* log prob (that pointer).
So for our entire corpus-- probability distribution over those pointers, then
The length of the compressed size of
each piece w is -log prob(w); so...
Total compressed length of the corpus is:
For a given set of data D, choose the analysis Ai to minimize the function:
Length(Compression of D using Ai)
The data is the corpus.
The compressed length of the corpus is just (summing over the words)
1. The frequency of the suffixal pattern in which the word is found (dog-s, dog-’s, dog-NULL);
2. The frequency of the stem (dog);
3. The frequency of the suffix (-s) within that pattern (-s, -’s, -NULL)
The pattern of suffixes that a stem takes is its signature:
W is analyzed as belonging to
Signature s,stem T and suffix F.
[x] means the
count of x’s
in the corpus
Where [W] is the total number of words.
Actually what we care about is the log of this:
A morphology is a set of 3 things:
A list of suffixes consists of:
punctuation is found (dog-
~ of length log(4)
of length 3,
because p(ed) = 1/8
of length 2,
because 2 letters long
Indication of size of the list (of length log (size));
List of pointers to each stem, where each pointer is of length - log freq (stem);
Concatenation of stems (sum of lengths of stems in letters)
What is the size of an individual signature? It consists of two subparts:
for the words dog, dogs, cat, cats, glove, gloves
Sum of the lengths
of the pointers
to the stems
Sum of the lengths
of the pointers
to the suffixes
on distribution of
over all the words.)
(iv) Signature component: is found (dog-
Signature component is found (dog-
list of pointers to signatures
<X> indicates the number
of distinct elements in X
1. Take top 100 ngrams based on weighted mutual information as candidate morphemes of the language:
If a word ends in a candidate morpheme, split it thusly, to form a candidate stem thereby:
This turns out to be a lot harder than you’d think, given what I’ve said so far.
Short answer is a heuristic: maximize the objective function
There’s no good short explanation for this,
except this:the frequency of a single letter is a very bad
first approximation of its likelihood to be a morpheme.
Now eliminate all signatures that appear only once.
This gives us an excellent first guess for the morphology.
abrupt NULL ly ness.
abs ence ent.
absent -minded NULL ia ly.
absent-minded NULL ly
absentee NULL ism
absolu NULL e ment.
absorb ait ant e er é ée
abus ait er
abîm e es ée.
Top 10, 100K words
1 .NULL.ed.ing. 65 1214
2 .NULL.ed.ing.s. 27 1464
3 .NULL.s. 290 8184
4 .'s.NULL.s. 27 2645
5 .NULL.ed.s. 26 541
6 .NULL.ly. 128 2124
7 .NULL.ed. 87 767
8 .'s.NULL. 75 3655
9 .NULL.d.s. 14 510
10 .NULL.ing. 62 983
heap check revolt
plunder look obtain
escort proclaim arrest
gain destroy stay
suspect kill consent
knock track succeed
answer frighten glitter.…\
Only one stem for this signature:
Just the same, in mirror-image style. Perform either on stems or on words.
Problems that arise:
1. “ments” problem: a suffix may really be two suffixes.
2. ted.ting.ts: a letter which occurs stem finally with high frequency may get wrongly parsed
(e.g., shou-ted, shou-ting, shou-ts).
3. Spurious signatures form a
4. Misplaced word-breaks
We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths.
+ Compressed data
Then the size of the punctuation for the 3 lists is:
Then the change of the size of the punctuation in the lists:
Change in its size when we
consider a modification to the morphology:
1. Global effects of change of number of suffixes;
2. Effects on change of size of suffixes in both states;
3. Suffixes present only in state 1;
4. Suffixes present only in state 2;
Global effect of change
on all suffixes
Contribution of suffixes
that appear only in State1
Contribution of suffixes
that appear only in State 2
Entropy, MDL, and morphology form a
Why using MDL is closely related to measuring the complexity of the space of possible vocabularies
Consider the space of all words of length L, built from an alphabet of size b.
How many ways are there to build a vocabulary of size N?Call that U(b,L,N).
Compare that with the operation (choosing a set of N words of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.
If we take the complexity of each task to be measured by the log of its size, then we’re asking the size of:
is easy to approximate, however. of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.
The number of bits needed of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.
to list all the words:
The length of all the pointers
to all the words:
the compressed corpus
Thus the log of the number of vocabularies =
description length of that vocabulary,
in the terms we’ve been using
That means that the differences in the sizes of the spaces of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.
of possible vocabularies is equal to the difference in the
description length in the two cases:
Difference of complexity of “simplex word” analysis
and complexity of analyzed word analysis=
log U(b,L,N) - U(b,t,T)-U(b,f,F)
Difference in size of
Difference in size
of compressed data
But we’ve (over)simplified in this case by ignoring the frequencies inherent in real corpora. What’s of great interest in real life is the fact that some suffixes are used often, others rarely, and similarly for stems.
We know something about the distribution of words, but nothing about distribution of stems and especially suffixes.
But suppose we wanted to think about the statistics of vocabulary choice in which words could be selected more than once….
We want to select N words of length L, and the same word can be selected. How many ways of doing this are there?
These are like bosons: you can have any number of occurrence of a word, and 2 sets of the same number of them are indistinguishable. How many such vocabularies are there, then?
where Z(i) is the number of words of frequency be selected. How many ways of doing this are there?i.
(‘Z’ stands for “Zipf”).
We don’t know much about frequencies of suffixes,
but Zipf’s law says that
hence for a morpheme
set that obeyed
the Zipf distribution: