Automatic learning of morphology. John Goldsmith July 2003 University of Chicago. Language learning: unsupervised learning. Not “theoretical” – but based on a theory with solid foundations. Practical, real data.
Automatic learning of morphology
University of Chicago
There are explanations and other downloads available there.
Technical description in
Computational Linguistics (June 2001)
“Unsupervised Learning of the Morphology of a Natural Language”
Here: lists of stems, affixes,
Here: some messages
from the analyst to the
Actions and outlines of information
List of stems
A stem’s signature is the list of suffixes it appears with in the corpus,
in alphabetical order.
abilit ies.yabilities, ability
absolute NULL-lyabsolute, absolutely
List of signatures
Signature: NULL ed ing s
account accounted accounting accounts
add added adding adds
More sophisticated signature…
Signature <e>ion . NULL
composite concentratecorporate détente
What is this?
composite composit composit + ion
It infers that iondeletes a stem-final ‘e’ before attaching.
In French, we find that the outermost layer of morphology is
not so interesting: it’s mostly é, e, and s. But we can get inside
the morphology of the resulting stems, and get the roots:
(i) if the task is still extremely difficult, and
(ii) what kind of language acquisition device could be capable of dealing with the problem.
(this has nothing to do with Optimality theory, which does not optimize any function! Optimization means finding a maximum or minimum – remember calculus?)
Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function. (We’ll get to MDL in a moment)
Naive Minimum Description Length
jump, jumps, jumping
laugh, laughed, laughing
sing, sang, singing
the, dog, dogs
total: 62 letters
Stems: jump laugh sing sang dog (20 letters)
Suffixes: s ing ed (6 letters)
Unanalyzed: the (3 letters)
total: 29 letters.
Notice that the description length
goes UP if we analyze sing into s+ing
Sum all the letters, plus all the structure inherent in the description, using information theory.
The essence of what you need to know from information theory is this:
that mentioning an object can be modeled by a pointer to that object,
whose length (complexity) is equal to -1 times the log of its frequency.
But why you should care about -log (freq(x)) =
is much less obvious.
Number of letters in suffix
l = number of bits/letter < 5
cost of setting up
this entity: length
of pointer in bits
Number of letters in stem
list of pointers to signatures
<X> indicates the number
of distinct elements in X
Probabilistic morphology: the measure:
where the morphology assigns a probability to any data set.
This is known in information theory as the optimal compressed length of the data (given the model).
A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure).
If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.
This follows from the basic principle of rationality in the Universe:
Maximize the probability of the observed data.
There is an objective answer to the question: which of two analyses of a given set of data is better? (modulo the differences between different universal Turing machines)
However, there is no general, practical guarantee of being able to find the best analysis of a given set of data.
Hence, we need to think of (this sort of) linguistics as being divided into two parts:
(Remember, these “things” are mathematical things: algorithms.)
How do we start?
Successor frequency of jum: 2
jum p (jump, jumping, jumps, jumped, jumpy)
Successor frequency of jump:5
Zellig Harris:Successor Frequency
19 9 6 3 1 3 1 1
a c c e p t i n g
Zellig Harris: Successor frequency
b debate, debuting
c decade, december, decide
d dedicate, deduce, deduct
edefeat, defend, defer
Zellig Harris:Successor frequencies
9 18 11 6 4 1 2 1 1 2 1 1
c o n s e r v a t i v e s
it cannot distinguish between
But that’s the problem it’s supposed to solve.
Analysis based on successor frequency
Pick a large corpus from a language --
5,000 to 1,000,000 words.
Feed it into the
Out of which comes a
which need not be superb.
Feed it to the incremental
Out comes a modified
Is the modification
If it is an improvement,
replace the morphology...
Send it back to the
Continue until there
are no improvements
We need to compute the Description Length of the analysis
as it stands versus
as it would be if we shifted varying parts of the stems to the suffixes.
Current description length is roughly:
The total length of the letters in the stems, converted to bits (by a factor of how many bits per letter) PLUS
The sum of the pointer-lengths to the suffixes – each pointer-length is of length -log( frequency ).
“delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures:
Compare the signatures
?: and, to, in that, for, he, as, with,
on, by, at, or, from…
finite verbs: was, had,
has, would, said,
could, did, might,
told, knew, took,
world, way, same, united,
right, system, city, case,
church, problem, company,
past, field, cost, department,
university, rate, door,
non-finite verbs: be, do, go, make,
see, get, take, go, say, put,
find, give, provide, keep, run…
Prepositions: of in for on by at from
into after through under since
during against among within along
across including near
social national white local political
personal private strong medical final
black French technical nuclear british