Linguistica unsupervised learning of natural language morphology using mdl
This presentation is the property of its rightful owner.
Sponsored Links
1 / 69

Linguistica : Unsupervised Learning of Natural Language Morphology Using MDL PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Linguistica : Unsupervised Learning of Natural Language Morphology Using MDL. John Goldsmith Department of Linguistics The University of Chicago. The Goal:. To develop a program that learns the structure of words in any human language on the basis of a raw text.

Download Presentation

Linguistica : Unsupervised Learning of Natural Language Morphology Using MDL

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Linguistica unsupervised learning of natural language morphology using mdl

Linguistica:Unsupervised Learning of Natural Language Morphology Using MDL

John Goldsmith

Department of Linguistics

The University of Chicago


The goal

The Goal:

  • To develop a program that learns the structure of words in any human language on the basis of a raw text.

  • No human supervision, except for the naïve creation of the text.


Value

Value

  • To linguistic theory: reconstruct linguistic theory in a quantitative fashion

  • Practical value:

    • Information retrieval on data bases of unrestricted languages

    • develop stochastic morphologies rapidly: necessary for automatic speech recognition

  • A step towards syntax


The product

The product

  • Currently a C++ program that functions as a Windows-based tool for corpus-based linguistics.

  • Available in beta version on web site.


What do we want

What do we want?

If you give the program a computer file containing Tom Sawyer, it should tell you that the language has a category of words that take the suffixes ing,s,ed, and NULL; another category that takes the suffixes 's, s, and NULL;

If you give it Jules Verne, it tells you there's a category with suffixes:

a aient ait ant (chanta, chantaient, chantait, chantant)


Immediate queries

Immediate queries:

  • Do you tell it what language to expect? No.

  • Does it have access to meaning? No.

  • Does that matter? No.

  • How much data does it need. ...


Linguistica unsupervised learning of natural language morphology using mdl

How much data do you need?

  • You get reasonable results fast, with 5,000 words, but results are much better with 50,000, and much better with 500,000 words (length of corpus).

  • 100,000 word tokens ~ 12,000 distinct words.


Game plan

Game plan

  • Overview of MDL = Minimum Description Length, where

  • Description Length = Length of Analysis + Length of Compressed Data

  • Length of data as optimal compressed length of the corpus, given probabilities derived from morphology

  • Length of morphology in information theoretic terms

  • MDL is dead without heuristics…(then again, heuristics without MDL lack all finesse.)


Game plan continued

Game plan (continued)

  • Heuristic 1: discover basic candidate suffixes of the language using weighted mutual information

  • Heuristic 2: use these to find regular signatures;

  • Now use MDL to correct errors generated by heuristics


Game plan end

Game plan (end)

Why using MDL is closely related to measuring the (log of the) size of the space of possible vocabularies.


Linguistica unsupervised learning of natural language morphology using mdl

Turning to the problem

of learning

morphology...


Linguistica unsupervised learning of natural language morphology using mdl

  • For the purposes of version 1 of Linguistica 1, I will restrict myself to Indo-European languages, and in general languages in which the average number of suffixes per word is not greater than 2. (We drop this requirement in Linguistica 2.)


Minimum description length rissanen 1989

Minimum Description Length (Rissanen 1989)

Basic idea:

A good analysis of a set of data is one that (1) extracts the structure found in the data, and (2) which does it without overfitting the data.


Linguistica unsupervised learning of natural language morphology using mdl

If you have a set of pointers to a bunch of objects, and a probability distribution over those pointers, then

You may act as if the information-length of each pointer =

-1* log prob (that pointer).


Linguistica unsupervised learning of natural language morphology using mdl

So for our entire corpus--

The length of the compressed size of

each piece w is -log prob(w); so...

Total compressed length of the corpus is:


Overfitting the data

Overfitting the data:

  • The Gettysburg Address can be compressed to 2 bits if you choose an eccentric encoding scheme.

  • But that encoding scheme (1) will be long, and (2) will do more poorly than an encoding scheme that does not waste its probability mass on the Gettysburg Address.


Even scientific theories bow to the exigencies of mdl

Even scientific theories bow to the exigencies of MDL...

  • in a sense.

  • A theory is penalized if it does not capture generalizations within the observational data (e.g., predicting future observations on the basis of the initial conditions);

  • It is penalized if it is more complex than it needs to be (Ockham’s Razor).


Minimum description length

Minimum Description Length:

For a given set of data D, choose the analysis Ai to minimize the function:

Length(Compression of D using Ai)

+

Length (Ai)


Compressed length of the data using a i

Compressed length of the data using Ai?

The data is the corpus.

The compressed length of the corpus is just (summing over the words)


Our morphology has two necessary properties

Our morphology has two necessary properties:

  • It must assign a probability to every word of the language (so that we can speak of its ability to compress the corpus) -- we’ll return to this immediately;

  • And it must have a well-defined length.


Morphology assigns a frequency

Morphology assigns a frequency:

  • If the morphology assigns no internal structure to a word (John, the, …), it assigns the observed frequency to the word.

  • If the morphology analyzes a word (dog+s), it assigns a frequency to that word on the basis of 3 things:


Linguistica unsupervised learning of natural language morphology using mdl

1. The frequency of the suffixal pattern in which the word is found (dog-s, dog-’s, dog-NULL);

2. The frequency of the stem (dog);

3. The frequency of the suffix (-s) within that pattern (-s, -’s, -NULL)


Terminology

Terminology:

The pattern of suffixes that a stem takes is its signature:

  • NULL.ed.ing.s

  • NULL.er.est.ness


Frequency of analyzed word

Frequency of analyzed word

W is analyzed as belonging to

Signature s,stem T and suffix F.

[x] means the

count of x’s

in the corpus

(token count)

Where [W] is the total number of words.

Actually what we care about is the log of this:


So far

So far:

  • The behavior we demand of our morphology is that it assign a frequency to any given word; we need this so that we can evaluate the particular morphology’s goodness as an analysis, i.e., as a compressor.


Next let s see how to measure the length of a morphology

Next, let’s see how to measurethe length of a morphology

A morphology is a set of 3 things:

  • A list of stems;

  • A list of suffixes;

  • A list of signatures with the associated stems.


Let s measure the list of suffixes

Let’s measure the list of suffixes

A list of suffixes consists of:

  • a piece of punctuation telling us how long the list is (of length log (size) );

  • A list of pointers to the suffixes (each pointer of size - log (freq (suffix));

  • A concatenation of the letters of the suffixes (we could compress this, too, or just count number of letters).


Linguistica unsupervised learning of natural language morphology using mdl

punctuation

~ of length log(4)

of length 3,

because p(ed) = 1/8

4

pointered

pointers

pointerNULL

pointering

ed

s

NULL

ing

of length 2,

because 2 letters long


Same for stem list

Same for stem list:

Indication of size of the list (of length log (size));

List of pointers to each stem, where each pointer is of length - log freq (stem);

Concatenation of stems (sum of lengths of stems in letters)


Size of the signature list

Size of the signature list

What is the size of an individual signature? It consists of two subparts:

  • a list of pointers to stems, and a list of pointers to suffixes.

  • And we already know how to measure the size of a list of pointers.


An individual signature

An individual signature

for the words dog, dogs, cat, cats, glove, gloves


Length of a signature

Length of a signature

Sum of the lengths

of the pointers

to the stems

Sum of the lengths

of the pointers

to the suffixes


I m glossing over an important natural language complexity recursive structure

I’m glossing over an importantnatural language complexity:recursive structure.

word

(Significant effects

on distribution of

probability mass

over all the words.)

word

find

ing

s


So the total length of the morphology is

So the total length of the morphology is...


Linguistica unsupervised learning of natural language morphology using mdl

(iv) Signature component:


Linguistica unsupervised learning of natural language morphology using mdl

Signature component

list of pointers to signatures

<X> indicates the number

of distinct elements in X


Mdl needs heuristics

MDL needs heuristics

  • MDL does only one thing: it tells you which of two analyses is better.

  • It doesn’t tell you how to find those analysis.


Overall strategy

Overall strategy

  • Use initial heuristic to establish sets of signatures and sets of stems.

  • Use heuristics to propose various corrections.

  • Use MDL to decide on whether proposed corrections are to be accepted or refused.


Initial heuristic

Initial Heuristic

1. Take top 100 ngrams based on weighted mutual information as candidate morphemes of the language:


Linguistica unsupervised learning of natural language morphology using mdl

If a word ends in a candidate morpheme, split it thusly, to form a candidate stem thereby:

sanity:

  • sanit + y

  • sanity

  • san + ity


How to choose in ambiguous cases

How to choose in ambiguous cases?

This turns out to be a lot harder than you’d think, given what I’ve said so far.

Short answer is a heuristic: maximize the objective function

There’s no good short explanation for this,

except this:the frequency of a single letter is a very bad

first approximation of its likelihood to be a morpheme.


For each stem find the suffixes it appears with

For each stem, find the suffixes it appears with

  • This forms its signature:

  • NULL.ed.ing.s, for example.

    Now eliminate all signatures that appear only once.

    This gives us an excellent first guess for the morphology.


Stems with their signatures

Stems with their signatures

abrupt NULL ly ness.

abs ence ent.

absent -minded NULL ia ly.

absent-minded NULL ly

absentee NULL ism

(French:)

absolu NULL e ment.

absorb ait ant e er é ée

abus ait er

abîm e es ée.


Now build up signature collection

Now build up signature collection...

Top 10, 100K words

1 .NULL.ed.ing. 65 1214

2.NULL.ed.ing.s. 27 1464

3 .NULL.s. 290 8184

4 .'s.NULL.s. 27 2645

5 .NULL.ed.s. 26 541

6 .NULL.ly. 128 2124

7 .NULL.ed. 87 767

8 .'s.NULL. 75 3655

9 .NULL.d.s. 14 510

10 .NULL.ing. 62 983


Verbose signature

Verbose signature...

.NULL.ed.ing. 58

heapcheckrevolt

plunderlookobtain

escortproclaimarrest

gaindestroystay

suspectkillconsent

knocktracksucceed

answerfrightenglitter.…\


Find strictly regular signatures

Find strictly regular signatures:

  • A signature is strictly regular if it contains more than one suffix, and is found on more than one stem.

  • A suffix found in a strictly regular suffix is a regular suffix.

  • Keep only signatures composed of regular suffixes (=regular signatures).


Examples of non regular signatures

Examples of non-regular signatures

Only one stem for this signature:

  • ch.e.erial.erials.rimony.rons.uring

  • el.ezed.nce.reupon.ther


Prefixes

Prefixes

Just the same, in mirror-image style. Perform either on stems or on words.


English prefixes

English prefixes

.NULL.re.8

.NULL.dis.7

.NULL.de.4

.NULL.un.4

.NULL.con.3

.NULL.en.3

.NULL.al.3

.NULL.t.3

.NULL.con.ex.2


French prefix signatures

French prefix signatures

NULL.d'.l'.NULL.d'.

NULL.l'.NULL.dé.

NULL.re.d'.l'.

NULL.qu'.NULL.par.

NULL.en.NULL.in.

NULL.di.NULL.com.

NULL.s'.NULL.l'en.

NULL.n'.NULL.cou.

NULL.pro.NULL.ent.

NULL.ré.NULL.d'.s'.


Now use mdl to fix problems repair heuristics

Now use MDL to fix problems:Repair heuristics

Problems that arise:

1. “ments” problem: a suffix may really be two suffixes.

2. ted.ting.ts: a letter which occurs stem finally with high frequency may get wrongly parsed

(e.g., shou-ted, shou-ting, shou-ts).


Linguistica unsupervised learning of natural language morphology using mdl

3. Spurious signatures

4. Misplaced word-breaks


Repair heuristics using mdl

Repair heuristics: using MDL

We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths.

Original morphology

+ Compressed data

Revised

morphology+

compressed data

<

>


Linguistica unsupervised learning of natural language morphology using mdl

But it’s better to have a more thoughtful approach.

Let’s define

Then the size of the punctuation for the 3 lists is:

Then the change of the size of the punctuation in the lists:


Size of the suffix component remember

Size of the suffix component, remember:

Change in its size when we

consider a modification to the morphology:

1. Global effects of change of number of suffixes;

2. Effects on change of size of suffixes in both states;

3. Suffixes present only in state 1;

4. Suffixes present only in state 2;


Suffix component change

Suffix component change:

Suffixes whose

counts change

Global effect of change

on all suffixes

Contribution of suffixes

that appear only in State1

Contribution of suffixes

that appear only in State 2


Linguistica unsupervised learning of natural language morphology using mdl

Entropy, MDL, and morphology

Why using MDL is closely related to measuring the complexity of the space of possible vocabularies


Linguistica unsupervised learning of natural language morphology using mdl

Consider the space of all words of length L, built from an alphabet of size b.

How many ways are there to build a vocabulary of size N?Call that U(b,L,N).

Clearly,


Linguistica unsupervised learning of natural language morphology using mdl

Compare that with the operation (choosing a set of N words of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.

If we take the complexity of each task to be measured by the log of its size, then we’re asking the size of:


Linguistica unsupervised learning of natural language morphology using mdl

is easy to approximate, however.

remember:


Linguistica unsupervised learning of natural language morphology using mdl

The number of bits needed

to list all the words:

the analysis

The length of all the pointers

to all the words:

the compressed corpus

Thus the log of the number of vocabularies =

description length of that vocabulary,

in the terms we’ve been using


Linguistica unsupervised learning of natural language morphology using mdl

That means that the differences in the sizes of the spaces

of possible vocabularies is equal to the difference in the

description length in the two cases:

hence,

Difference of complexity of “simplex word” analysis

and complexity of analyzed word analysis=

log U(b,L,N) - U(b,t,T)-U(b,f,F)

Difference in size of

morphologies

Difference in size

of compressed data


Linguistica unsupervised learning of natural language morphology using mdl

But we’ve (over)simplified in this case by ignoring the frequencies inherent in real corpora. What’s of great interest in real life is the fact that some suffixes are used often, others rarely, and similarly for stems.


Linguistica unsupervised learning of natural language morphology using mdl

We know something about the distribution of words, but nothing about distribution of stems and especially suffixes.

But suppose we wanted to think about the statistics of vocabulary choice in which words could be selected more than once….


Linguistica unsupervised learning of natural language morphology using mdl

We want to select N words of length L, and the same word can be selected. How many ways of doing this are there?

These are like bosons: you can have any number of occurrence of a word, and 2 sets of the same number of them are indistinguishable. How many such vocabularies are there, then?


Linguistica unsupervised learning of natural language morphology using mdl

where Z(i) is the number of words of frequency i.

(‘Z’ stands for “Zipf”).

We don’t know much about frequencies of suffixes,

but Zipf’s law says that

hence for a morpheme

set that obeyed

the Zipf distribution:


Linguistica unsupervised learning of natural language morphology using mdl

End


  • Login