Automatic learning of morphology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 83

Automatic learning of morphology PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Automatic learning of morphology. John Goldsmith July 2003 University of Chicago. Language learning: unsupervised learning. Not “theoretical” – but based on a theory with solid foundations. Practical, real data.

Download Presentation

Automatic learning of morphology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic learning of morphology

Automatic learning of morphology

John Goldsmith

July 2003

University of Chicago


Language learning unsupervised learning

Language learning: unsupervised learning

  • Not “theoretical” – but based on a theory with solid foundations.

  • Practical, real data.

  • Don’t wait till your grammars are written to start worrying about language learning. You don’t know what language learning is till you’ve tried it. (Like waiting till your French pronunciation is perfect before you start writing a phonology of the language.)


Automatic learning of morphology

  • What you need (to write a language learning device) does not look like the stuff you codified in your grammar. Segmentation and classification.


Maximize the probability of the data

Maximize the probability of the data.

  • This leads to Minimum Description Length theory, which says:

    • Minimize the sum of:

      • Positive log probability of the data +

      • Length of the grammar

  • It thus leads to a non-cognitive foundation for a science of linguistics – if you happen to be interested in that. You do not need to be. I am.


Automatic learning of morphology

!!

  • Discovery of structure in the data is always equivalent to an increase in the probability that the model assigns to that data.

  • The devil is in the details.


Classes to come

Classes to come:

  • Tuesday: the basics of probability theory, and the treatment and learning of phonotactics, and their role in nativization and alternations

  • Thursday: MDL and the discovery of “chunks” in an unbroken string of data.


Looking ahead

Looking ahead

  • Probability involves a set of numbers (not negative) that sum to 1.

  • Logarithm of numbers between 0 and 1 are negative. So we shift our attention to -1 times the log (the positiv log: plog).


Automatic learning of morphology

  • 23 = 8, so log2 (8) is 3.

  • 24 = 16, so log2 (16) is 4.

  • 210 = 1024, so log2 (1024) is 10

  • 2-1 = ½ , so log2 (1/2) is -1.

  • 2-2 = 1/4 , so log2 (1/4) is -2.

  • 2-10 = 1/1024 , so log2 (1/1024) is -10


Plog positive logs

Plog (positive logs)

  • These numbers get bigger when the fraction gets smaller (closer to zero).

  • They get smaller when the fraction gets bigger (close to 1). Since we want big fractions (high probability), we want small plogs.

  • The plog is also the length of the compressed form a word. When you use WinZip, the length of the file is the sum of a lot of plogs for all the words (not exactly words, really, but close).


Evolution of computational applications involving language

Evolution of computational applications involving language


Learning

Learning

  • The relationship between data and grammar.

  • The goal is to create a device that learns aspects of language, given data: a little linguist in a tin box.

  • Today: morphological structure.


Linguistica

Linguistica

  • A C++ program that runs under Windows that is available at my homepage

    http://humanities.uchicago.edu/ faculty/goldsmith/

    There are explanations and other downloads available there.


Automatic learning of morphology

Technical description in

Computational Linguistics (June 2001)

“Unsupervised Learning of the Morphology of a Natural Language”


Overview

Overview

  • Look at Linguistica in action:

    English, French

  • Why do this?

  • What is the theory behind it?

  • What are the heuristics that make it work?

  • Where do we go from here?


Linguistica1

Linguistica

  • A program that takes in a text in an “unknown” language…

  • and produces a morphological analysis:

  • a list of stems, prefixes, suffixes;

  • more deeply embedded morphological structure;

  • regular allomorphy


Automatic learning of morphology

raw data

Linguistica

Analyzed data


Automatic learning of morphology

Here: lists of stems, affixes,

signatures, etc.

Here: some messages

from the analyst to the

user.

Actions and outlines of information


Read a corpus

Read a corpus

  • Brown corpus: 1,200,000 words of typical English

  • French Encarta

  • or anything else you like, in a text file.

  • First set the number of words you want read, then select the file.


Automatic learning of morphology

List of stems

A stem’s signature is the list of suffixes it appears with in the corpus,

in alphabetical order.

abilit ies.yabilities, ability

abolitionabolition

absence-tabsence, absent

absolute NULL-lyabsolute, absolutely


Automatic learning of morphology

List of signatures


Automatic learning of morphology

Signature: NULL ed ing s

for example,

account accounted accounting accounts

add added adding adds


Automatic learning of morphology

More sophisticated signature…

Signature <e>ion . NULL

composite concentratecorporate détente

discriminateevacuateinflateopposite

participateprobateprosecutetense

What is this?

compositeand composition

composite composit  composit + ion

It infers that iondeletes a stem-final ‘e’ before attaching.


Top signatures in english

Top signatures in English


Top signatures in french

Top signatures in French

In French, we find that the outermost layer of morphology is

not so interesting: it’s mostly é, e, and s. But we can get inside

the morphology of the resulting stems, and get the roots:


French roots

French roots


Let s look at the program

Let’s look at the program.


Why do this

Why do this?

  • (It is a lot of fun.)

  • It can be of practical use: stemming for information retrieval, analysis for statistically-based machine translation.

  • This clarifies what the task of language-acquisition is.


Language acquisition

Language acquisition

  • It’s been suggested that (since language acquisition seems to be dauntingly, impossibly hard) it must require prior (innate) knowledge.

  • Let’s choose a task where innate knowledge cannot plausibly be appealed to, and see

    (i) if the task is still extremely difficult, and

    (ii) what kind of language acquisition device could be capable of dealing with the problem.


Learning of morphology

Learning of morphology

  • The nature of morphology-acquisition does not become clearer by reducing the number of possible analyses of the data, but rather by

  • Better understanding the formal character of knowledge and learning.


Over arching theory

Over-arching theory

  • The selection of a grammar, given the data, is an optimization problem.

    (this has nothing to do with Optimality theory, which does not optimize any function! Optimization means finding a maximum or minimum – remember calculus?)

    Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function. (We’ll get to MDL in a moment)


What s being minimized by writing a good morphology

What’s being minimized by writing a good morphology?

  • Number of letters, for one:

  • compare:


Automatic learning of morphology

Naive Minimum Description Length

Corpus:

jump, jumps, jumping

laugh, laughed, laughing

sing, sang, singing

the, dog, dogs

total: 62 letters

Analysis:

Stems: jump laugh sing sang dog (20 letters)

Suffixes: s ing ed (6 letters)

Unanalyzed: the (3 letters)

total: 29 letters.

Notice that the description length

goes UP if we analyze sing into s+ing


Minimum description length mdl

Minimum Description Length (MDL)

  • Jorma Rissanen 1989

  • The best “theory” of a set of data is the one which is simultaneously:

    • 1 most compact or concise, and

    • 2 provides the best modeling of the data

  • “Most compact” can be measured in bits, using information theory

  • “Best modeling” can also be measured in bits…


Essence of mdl

Essence of MDL


Description length

Description Length =

  • Conciseness: Length of the morphology. It’s almost as if you count up the number of symbols in the morphology (in the stems, the affixes, and the rules).

  • Length of the modeling of the data. We want a measure which gets bigger as the morphology is a worse description of the data.

  • Add these two lengths together = Description Length


Conciseness of the morphology

Conciseness of the morphology

Sum all the letters, plus all the structure inherent in the description, using information theory.

The essence of what you need to know from information theory is this:

that mentioning an object can be modeled by a pointer to that object,

whose length (complexity) is equal to -1 times the log of its frequency.

But why you should care about -log (freq(x)) =

is much less obvious.


Conciseness of stem list and suffix list

Conciseness of stem list and suffix list

Number of letters in suffix

l = number of bits/letter < 5

cost of setting up

this entity: length

of pointer in bits

Number of letters in stem


Signature list length

Signature list length

list of pointers to signatures

<X> indicates the number

of distinct elements in X


Length of the modeling of the data

Length of the modeling of the data

Probabilistic morphology: the measure:

  • -1 * log probability ( data )

    where the morphology assigns a probability to any data set.

    This is known in information theory as the optimal compressed length of the data (given the model).


Probability of a data set

Probability of a data set?

A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure).

If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.


Automatic learning of morphology

This follows from the basic principle of rationality in the Universe:

Maximize the probability of the observed data.


From all this it follows

From all this, it follows:

There is an objective answer to the question: which of two analyses of a given set of data is better? (modulo the differences between different universal Turing machines)

However, there is no general, practical guarantee of being able to find the best analysis of a given set of data.

Hence, we need to think of (this sort of) linguistics as being divided into two parts:


Automatic learning of morphology

  • An evaluator (which computes the Description Length); and

  • A set of heuristics, which create grammars from data, and which propose modifications of grammars, in the hopes of improving the grammar.

    (Remember, these “things” are mathematical things: algorithms.)


Let s get back down to earth

Let’s get back down to Earth

  • Why is this problem so hard at first?

  • Because figuring out the best analysis of any given word generally requires having figured out the rough outlines of the whole overall morphology. (Same is true for other parts of the grammar!).

    How do we start?


Automatic learning of morphology

  • We’ll modify a suggestion made by Zellig Harris (1955, 1967, 1979[1968]). Harris always believed this would work.

  • It doesn’t, but it’s clever and it’s a good start – but only that.


Zellig harris successor frequency

Zellig Harris:successor frequency

Successor frequency of jum: 2

jum p (jump, jumping, jumps, jumped, jumpy)

b (jumble)

Successor frequency of jump:5

e (jumped)

i (jumping)

jumps (jumps)

y (jumpy)

# (jump)


Automatic learning of morphology

Zellig Harris:Successor Frequency

predicted break

19 9 6 3 1 3 1 1

a c c e p t i n g

able

ing

lerate (“accelerate”)

nted (“accented”)

ident (“accident”)

laim (“acclaim”)

omodate (“accomodate”)

reditated (“accredited”)

used (“accused”)


Automatic learning of morphology

Zellig Harris: Successor frequency

ddead

fdeaf

ldeal

ndean

tdeath

Bad

predictions

a

18

a

e

5

d

b debate, debuting

c decade, december, decide

d dedicate, deduce, deduct

e deep

f

9

i

edefeat, defend, defer

ideficit, deficiency

rdefraud

3

Good

predictions

o


Automatic learning of morphology

Zellig Harris:Successor frequencies

9 18 11 6 4 1 2 1 1 2 1 1

c o n s e r v a t i v e s

wrong

right

wrong


The problem with harris approach

The problem with Harris’ approach

it cannot distinguish between

  • phonological freedom due to phonological patterns (C after V, V after C)

  • phonological freedom due to morphological pattern (...any morpheme after a +...)

    But that’s the problem it’s supposed to solve.


Problems

(problems…)

  • It can’t deal with cases where several suffixes begin with the same letter(s).

  • E.g.

is

ais

donn

donna

it

ait

NULL

a

Analysis based on successor frequency

Correct analysis


But as a boot strapping method to construct a first approximation of the signatures

But as a boot-strapping method to construct a first approximation of the signatures:

  • Harris’ method is pretty good.

  • We accept only stems of 5 letters or more;

  • Only cuts where the SuccFreq is > 1, and where the neighboring SuccFreq is 1.


Let s look at how the work is done in the abstract step by step

Let’s look at how the work is done (in the abstract), step by step...


Automatic learning of morphology

Corpus

Pick a large corpus from a language --

5,000 to 1,000,000 words.


Automatic learning of morphology

Corpus

Feed it into the

“bootstrapping” heuristic...

Bootstrap heuristic


Automatic learning of morphology

Corpus

Bootstrap heuristic

Out of which comes a

preliminary morphology,

which need not be superb.

Morphology


Automatic learning of morphology

Corpus

Bootstrap heuristic

Feed it to the incremental

heuristics...

Morphology

incremental

heuristics


Automatic learning of morphology

Corpus

Out comes a modified

morphology.

Bootstrap heuristic

Morphology

modified

morphology

incremental

heuristics


Automatic learning of morphology

Corpus

Is the modification

an improvement?

Ask MDL!

Bootstrap heuristic

Morphology

modified

morphology

incremental

heuristics


Automatic learning of morphology

Corpus

If it is an improvement,

replace the morphology...

Bootstrap heuristic

modified

morphology

Morphology

Garbage


Automatic learning of morphology

Corpus

Send it back to the

incremental

heuristics again...

Bootstrap heuristic

modified

morphology

incremental

heuristics


Automatic learning of morphology

Continue until there

are no improvements

to try.

Morphology

modified

morphology

incremental

heuristics


The details of learning morphology

The details of learning morphology

  • There is nothing sacred about the particular choice of heuristic steps I have chosen…


Steps

Steps

  • Successor Frequency: strict

  • Extend signatures to cases where a word is composed of a known stem and a known suffix.

  • Loose fit: using 1st order MDL for new signatures

  • Check signatures: Using MDL to find best stem/suffix cut. (More on this…)

  • Smooth stems


100 000 tokens 12 208 types

100,000 tokens, 12,208 types


Check signatures

Check signatures

  • on/ve → ion/ive

  • an/en → man/men

  • l/tion → al/ation

  • m/t → alism/alist, etc.

    How?


Check signatures1

Check signatures

  • Signature l/tion with stems:

    federainauguraorientasubstantia

    We need to compute the Description Length of the analysis

    as it stands versus

    as it would be if we shifted varying parts of the stems to the suffixes.


Automatic learning of morphology

Current description length is roughly:

The total length of the letters in the stems, converted to bits (by a factor of how many bits per letter) PLUS

The sum of the pointer-lengths to the suffixes – each pointer-length is of length -log( frequency ).


Allomorphy

Allomorphy

  • Find relations among stems: find principles of allomorphy, like

    “delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures:

    Compare the signatures

    • NULL.ing, and

    • e.ing.


Null ing and e ing

NULL.ing and e.ing

  • NULL.ing: its stems do not end in –e

  • ing almost never appears after stem-final e.

  • So e.ing and NULL.ing can both be subsumed under:

  • <e>ing.NULL, where <e>ing means a suffix ing which deletes a preceding e.


More precisely

More precisely:

  • Find a signature of the form L.X, where L is a letter. Check that no stems end with L.

  • See if another signature NULL.X exists, none of whose stems end in L.

  • Clean up and extend.


Find layers of affixation

Find layers of affixation

  • Find roots (from among the Stem collection)


Where do we go from here

Where do we go from here?

  • Identifying suffixes through syntactic behavior ( syntax)

  • Better allomorphy ( phonology)

  • Languages with more morphemes/ word (“rich” morphology)


Automatic learning of morphology

  • “Using eigenvectors of the bigram graph to infer grammatical features and categories” (Belkin & Goldsmith 2002)


Method

Method

  • Build a graph in which “similar” words are adjacent;

  • Compute the normalized laplacian of that graph;

  • Compute the eigenvectors with the lowest non-zero eigenvalues;

  • Plot them.


Map 1 000 english words by left hand neighbors

Map 1,000 English words by left-hand neighbors

?: and, to, in that, for, he, as, with,

on, by, at, or, from…

finite verbs: was, had,

has, would, said,

could, did, might,

went, thought,

told, knew, took,

asked…

world, way, same, united,

right, system, city, case,

church, problem, company,

past, field, cost, department,

university, rate, door,

non-finite verbs: be, do, go, make,

see, get, take, go, say, put,

find, give, provide, keep, run…


Map 1 000 english words by right hand neighbors

Map 1,000 English words by right-hand neighbors

Prepositions: of in for on by at from

into after through under since

during against among within along

across including near

adjectives

social national white local political

personal private strong medical final

black French technical nuclear british


The end

The End


  • Login