# Variant definitions of pointer length in MDL - PowerPoint PPT Presentation

Variant definitions of pointer length in MDL. Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago. Degrees of freedom in MDL modeling. MDL does not specify the form of the grammar being inferred. Carl de Marcken (1996)

## PowerPoint Slideshow about 'Variant definitions of pointer length in MDL' - arama

### Variant definitions of pointer length in MDL

Aris Xanthos, Yu Hu, and John Goldsmith

University of Chicago

• MDL does not specify the form of the grammar being inferred.

• Carl de Marcken (1996)

• There are alternatives to pointers for representing connections.

• Different representations may lead to different grammars.

walk

jump

...

ed

ing

...

A sample signature:

Linguistica (Goldsmith 2001)

• Website: linguistica.uchicago.edu

• Data: corpus segmented into words

• Model:

• List of stems

• List of suffixes

• List of signatures

• Corpus C

• 2 or more competing models describing C

• Model M assigns a probability to C : pr(C | M)

• Compressed length of C given M :

L(C | M) = - log2pr(C | M)

• Length of model M : L( M )

• Description length of C given M :

DL(C | M) = L(C | M) + L( M )

• Bootstrapping heuristic: word = stem + suffix

• Successive heuristics propose modifications.

• MDL sanctions modifications.

• Compute L( corpus | model ) + L( model ) before and after modification.

• If it results in a decrease in DL, retain modification, otherwise discard it.

• L( morphology ) = sum of the lengths of lists (stems, suffixes, signatures)

• Length of a list = sum of the lengths of elements in it + small cost for list structure

• Length of a stem / suffix is proportional to the number of symbols in it.

{ }{ }

walk

jump

...

ed

ing

...

{ }

{ }

walk

jump

great

...

ed

ing

est

...

List of stems

List of suffixes

Length of the morphology (2)

• A signature specifies that a set of stems associate with a set of suffixes:

• A pointer is a symbol that stands for a given morpheme.

• The information content of a pointer to a morpheme m is - log2pr( m )

• The more probable the morpheme, the smaller the cost of a pointer to it:

• Length of signature = sum of lengths of 2 lists of pointers (to stems and to suffixes)

• Length of each list = sum of information cost of pointers in it + small cost for list structure

walking in the...

{ }{ }

{ }{ }

Morphology:

Morphology:

{ }

{ }

{ }

{ }

walk

jump

great

...

walk

jump

great

...

ed

ing

est

...

ed

ing

est

...

Compressed length of the corpus

• Compressed length of a word w =

information content of pointer to signature σ

+ information content of pointer to stem t given σ

+ information content of pointer to suffix f given σ

= - log2pr (σ) - log2pr (t|σ) - log2pr (f|σ)

• L( corpus | morphology ) = sum of lengths of each individual word

1

0

0

signature σ

List of (all) stems

{ }

{ }

{ }

walk

jump

great

...

chin

binary string

Alternatives to pointers

• There are alternatives to pointers for representing connections in the morphology.

• The number of symbols in a binary string is constant and equal to thetotal number of stems.

• The information content of the string depends on the distribution of 0's and 1's in it:

total number of stems

times

entropy of string

• Theoretical inference (see details in paper):

• Binary strings are shorter when:

• the distribution of stems tends to be uniform

• the distribution of the number of stems being pointed to tends to be uniform

• Lists of pointers are shorter when:

• the distribution of stems departs from uniformity

• the average number of stems being pointed to is small

walk

jump

...

ed

ing

{ }{ }

walks

broke

...

A specific example

• Current state of the morphology:

• Proposed modification: walks = walk + s

{ }{ }

...

jump

...

walk

ed

ing

s

ed

ing

{ }{ }

walks

broke

...

A specific example (2)

• State of the morphology after modification:

• Cost: pointers to ed, ing and s

• Savings: the string walks, a pointer to it

• The compressed length of binary strings is independent of the frequency of the items being pointed to.

• This encoding does not favor the creation of pointers to frequent items (or the deletion of pointers to rare items).

• There is more than one way of representing the connections between items in a grammar.

• The choice of a representation can have important consequences on the grammar being induced.

• Mathematical details can be found in the paper.