Variant definitions of pointer length in mdl
Download
1 / 18

Variant definitions of pointer length in MDL - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Variant definitions of pointer length in MDL. Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago. Degrees of freedom in MDL modeling. MDL does not specify the form of the grammar being inferred. Carl de Marcken (1996)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Variant definitions of pointer length in MDL' - arama


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Variant definitions of pointer length in mdl

Variant definitions of pointer length in MDL

Aris Xanthos, Yu Hu, and John Goldsmith

University of Chicago


Degrees of freedom in mdl modeling
Degrees of freedom in MDL modeling

  • MDL does not specify the form of the grammar being inferred.

  • Carl de Marcken (1996)

  • There are alternatives to pointers for representing connections.

  • Different representations may lead to different grammars.


Linguistica goldsmith 2001

{ }{ }

walk

jump

...

ed

ing

...

A sample signature:

Linguistica (Goldsmith 2001)

  • Website: linguistica.uchicago.edu

  • Data: corpus segmented into words

  • Model:

    • List of stems

    • List of suffixes

    • List of signatures


Reminder mdl analysis
Reminder: MDL analysis

  • Corpus C

  • 2 or more competing models describing C

  • Model M assigns a probability to C : pr(C | M)

  • Compressed length of C given M :

    L(C | M) = - log2pr(C | M)

  • Length of model M : L( M )

  • Description length of C given M :

    DL(C | M) = L(C | M) + L( M )


Learning process
Learning process

  • Bootstrapping heuristic: word = stem + suffix

  • Successive heuristics propose modifications.

  • MDL sanctions modifications.

  • Compute L( corpus | model ) + L( model ) before and after modification.

  • If it results in a decrease in DL, retain modification, otherwise discard it.


Length of the morphology
Length of the morphology

  • L( morphology ) = sum of the lengths of lists (stems, suffixes, signatures)

  • Length of a list = sum of the lengths of elements in it + small cost for list structure

  • Length of a stem / suffix is proportional to the number of symbols in it.


Length of the morphology 2

{ }{ }

{ }{ }

walk

jump

...

ed

ing

...

{ }

{ }

walk

jump

great

...

ed

ing

est

...

List of stems

List of suffixes

Length of the morphology (2)

  • A signature specifies that a set of stems associate with a set of suffixes:


Length of the morphology 3
Length of the morphology (3)

  • A pointer is a symbol that stands for a given morpheme.

  • The information content of a pointer to a morpheme m is - log2pr( m )

  • The more probable the morpheme, the smaller the cost of a pointer to it:


Length of the morphology 4
Length of the morphology (4)

  • Length of signature = sum of lengths of 2 lists of pointers (to stems and to suffixes)

  • Length of each list = sum of information cost of pointers in it + small cost for list structure


Compressed length of the corpus

Corpus:

walking in the...

{ }{ }

{ }{ }

Morphology:

Morphology:

{ }

{ }

{ }

{ }

walk

jump

great

...

walk

jump

great

...

ed

ing

est

...

ed

ing

est

...

Compressed length of the corpus


Compressed length of the corpus 2
Compressed length of the corpus (2)

  • Compressed length of a word w =

    information content of pointer to signature σ

    + information content of pointer to stem t given σ

    + information content of pointer to suffix f given σ

= - log2pr (σ) - log2pr (t|σ) - log2pr (f|σ)

  • L( corpus | morphology ) = sum of lengths of each individual word


Alternatives to pointers

1

1

0

0

signature σ

List of (all) stems

{ }

{ }

{ }

walk

jump

great

...

chin

binary string

Alternatives to pointers

  • There are alternatives to pointers for representing connections in the morphology.


List of pointers vs binary strings
List of pointers vs. binary strings

  • The number of symbols in a binary string is constant and equal to thetotal number of stems.

  • The information content of the string depends on the distribution of 0's and 1's in it:

total number of stems

times

entropy of string


Expected difference in dl
Expected difference in DL

  • Theoretical inference (see details in paper):

  • Binary strings are shorter when:

    • the distribution of stems tends to be uniform

    • the distribution of the number of stems being pointed to tends to be uniform

  • Lists of pointers are shorter when:

    • the distribution of stems departs from uniformity

    • the average number of stems being pointed to is small


A specific example

{ }{ }

walk

jump

...

ed

ing

{ }{ }

walks

broke

...

A specific example

  • Current state of the morphology:

  • Proposed modification: walks = walk + s


A specific example 2

{ }{ }

{ }{ }

...

jump

...

walk

ed

ing

s

ed

ing

{ }{ }

walks

broke

...

A specific example (2)

  • State of the morphology after modification:

  • Cost: pointers to ed, ing and s

  • Savings: the string walks, a pointer to it


Crucial difference
Crucial difference

  • The compressed length of binary strings is independent of the frequency of the items being pointed to.

  • This encoding does not favor the creation of pointers to frequent items (or the deletion of pointers to rare items).


Conclusion
Conclusion

  • There is more than one way of representing the connections between items in a grammar.

  • The choice of a representation can have important consequences on the grammar being induced.

  • Mathematical details can be found in the paper.