Induction of a simple morphology for highly inflecting languages
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

Induction of a Simple Morphology for Highly-Inflecting Languages PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Induction of a Simple Morphology for Highly-Inflecting Languages. {Mathias.Creutz, [email protected]

Download Presentation

Induction of a Simple Morphology for Highly-Inflecting Languages

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


  • nyky+ratkaisu+i+sta+mme

  • kahvi+n+juo+ja+lle+kin

  • tietä+isi+mme+kö+hän

  • open+mind+ed+ness

  • un+believ+able

Induction of a Simple Morphologyfor Highly-Inflecting Languages

{Mathias.Creutz, [email protected]

Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004.

Barcelona, 26 July 2004


Goals and challenges

  • Learnrepresentations of

    • the smallest meaningful units of language (morphemes)

    • and their interaction

    • in an unsupervised manner from raw text

    • making as general and language-independent assumptions as possible.

  • Evaluate

    • against a given gold-standard morphological analysis of word forms

      • first step: learn and evaluate a morpheme segmentation

        of word forms

    • integrated in NLP applications (speech recognition)

Mathias Creutz


Focus: Agglutinative morphology

  • Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes:

    • kahvi + n + juo + ja + lle + kin

    • (coffee + of + drink + -er + for + also)

    • nyky+ ratkaisu + i + sta + mme

    • (current + solution + -s + from + our)

    • tietä + isi + mme + kö+ hän

    • (know + would + we + INTERR + indeed)

  • Huge number of different possible word forms

  • Important to know the inner structure of words in NLP

  • The number of morphemes per word varies much

Mathias Creutz


a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

Learning from data

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

1. MDL model (Creutz & Lagus, 2002)(inspired by work of, e.g., J. Goldsmith)

”Invent” a set distinct strings = morphs

Aim at the

most concise

represent-

ation possible

Morph lexicon

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz


a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

2. Probabilistic formulation (Creutz, 2003)(inspired by work of, e.g., M. R. Brent and M. G. Snover)

Length

prior

”Invent” a set distinct strings = morphs

Morph lexicon

Frequency

prior

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz


Reflections on solutions 1 and 2

  • ”Dumb” text compression algorithms

    • Common substrings of words appear as one segment, even when compositional structure, e.g.,:

      • keskustelussa (keskustel + u + ssa; ”discuss+ion in”)

      • biggest (bigg + est)

    • Rare substrings of words are split, even when no compositional structure, e.g.,

      • a + den + auer (Adenauer; German politician)

      • in + s + an + e (in + sane)

    • Too weak structural constraints,

      e.g., suffixes recognized in the beginning of words:

      • s + can (scan)

Mathias Creutz


p(STM | PRE)

p(SUF | SUF)

p(’nyky’ | PRE)

p(’mme’ | SUF)

#

nyky

ratkaisu

i

sta

mme

#

3. Category-learning probabilistic model

  • Word structure captured by a regular expression:

  • word = ( prefix* stemsuffix* )+

  • Morph sequences (words) are generated by a Hidden Markov model:

Transition probs

Emission probs

Mathias Creutz


1. Start with an existing baseline morph segmentation (Creutz, 2003):

nyky + rat + kaisu + ista + mme

Category algorithm

2. Initialize category membership probs for each morph,

e.g., p(PRE | ’nyky’). Assume asymmetries between the categories:

Mathias Creutz


Initialization of category membership probs

  • Introduce a noise category for cases where none of the proper classes is likely:

  • Distribute remaining probability mass proportionally, e.g.,

Mathias Creutz


4. Split morphs that consist of other known morphs. Then EM:

nyky + rat + kaisu + i+sta + mme

5. Join noise morphs with their neighbours. Then EM:

nyky + ratkaisu + i+sta + mme

Category algorithm (continued)

1. Start with an existing baseline morph segmentation:

nyky + rat + kaisu + ista + mme

2. Initialize category membership probs for each morph.

3. Tag morphs as prefix, stem, suffix, ornoise. Then run EM on taggings:

nyky + rat + kaisu + ista + mme

Mathias Creutz


believ

hop

liv

mov

us

e

ed

es

ing

Experiments

  • Algorithms

    • Baseline model (Bayesian formulation)

    • Category-Learning model

    • Goldsmith’s ”Linguistica” (MDL formul.)

  • Data

    • Finnish data sets (CSC + STT)

      • 10 000 words, 50 000 words, 250 000 words, 16 million words

    • English data sets (Brown corpus)

      • 10 000 words, 50 000 words, 250 000 words

Mathias Creutz


”Gold standard” used in evaluation

  • Morpheme segmentation obtained for Finnish and English words

    • by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.)

  • Some ”fuzzy morpheme boundaries” allowed

    • mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g.,

      • Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor)

      • invite + s or invit + es; invite or invit + e (cf. invit + ing)

  • Compute precision and recall of correctly discovered morpheme boundaries

Mathias Creutz


Results (evaluated against the gold-standard)

Baseline

16M

10k

Categories

10k

250k

250k

16M

Categories

250k

10k

250k

Linguistica

10k

250k

Linguistica

10k

250k

10k

Baseline

Mathias Creutz


Discussion

  • The Category algorithm

    • overcomes many of the shortcomings of the Baseline algorithm

      • excessive or too little segmentation

      • suffixes in beginning of words

    • generalizes more than Linguistica, e.g.,

      • allus+ ion + s (Categories) vs. allusions (Linguistica)

      • Dem+i (Categories) vs. Demi (Linguistica)

    • invents its own solutions

      • aihe+e+sta vs. aihe+i+sta (”about [the] topic/-s”)

      • phrase, phrase+s, phrase+d

Mathias Creutz


Future directions

  • The Category algorithm could be expressed more elegantly

    • not as a post-processing procedure making use of a baseline segmentation

  • Segmentation into morphs is useful

    • e.g., n-gram language modeling in speech recognition

  • Detection of allomorphy, i.e., segmentation into morphemes would be even more useful

    • e.g., information retrieval (?)

Mathias Creutz


Public demo

  • A demo of the baseline and category-learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/.

  • Test it on your own Finnish or English input!

Mathias Creutz


Randomly

shuffle words

Recursive binary splitting

words

opening

openminded

openminded

reopened

reopened

conferences

reopen

minded

Morphs

mind

open

re

ed

Search for the optimal segmentation of the words in a corpus

Convergence

of descr. length?

yes

Done

no

Mathias Creutz


  • Login