Induction of a simple morphology for highly inflecting languages
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Induction of a Simple Morphology for Highly-Inflecting Languages PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Induction of a Simple Morphology for Highly-Inflecting Languages. {Mathias.Creutz, [email protected]

Download Presentation

Induction of a Simple Morphology for Highly-Inflecting Languages

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Induction of a simple morphology for highly inflecting languages

  • nyky+ratkaisu+i+sta+mme

  • kahvi+n+juo+ja+lle+kin

  • tietä+isi+mme+kö+hän

  • open+mind+ed+ness

  • un+believ+able

Induction of a Simple Morphologyfor Highly-Inflecting Languages

{Mathias.Creutz, [email protected]

Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004.

Barcelona, 26 July 2004


Goals and challenges

Goals and challenges

  • Learnrepresentations of

    • the smallest meaningful units of language (morphemes)

    • and their interaction

    • in an unsupervised manner from raw text

    • making as general and language-independent assumptions as possible.

  • Evaluate

    • against a given gold-standard morphological analysis of word forms

      • first step: learn and evaluate a morpheme segmentation

        of word forms

    • integrated in NLP applications (speech recognition)

Mathias Creutz


Induction of a simple morphology for highly inflecting languages

Focus: Agglutinative morphology

  • Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes:

    • kahvi + n + juo + ja + lle + kin

    • (coffee + of + drink + -er + for + also)

    • nyky+ ratkaisu + i + sta + mme

    • (current + solution + -s + from + our)

    • tietä + isi + mme + kö+ hän

    • (know + would + we + INTERR + indeed)

  • Huge number of different possible word forms

  • Important to know the inner structure of words in NLP

  • The number of morphemes per word varies much

Mathias Creutz


1 mdl model creutz lagus 2002 inspired by work of e g j goldsmith

a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

Learning from data

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

1. MDL model (Creutz & Lagus, 2002)(inspired by work of, e.g., J. Goldsmith)

”Invent” a set distinct strings = morphs

Aim at the

most concise

represent-

ation possible

Morph lexicon

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz


2 probabilistic formulation creutz 2003 inspired by work of e g m r brent and m g snover

a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

2. Probabilistic formulation (Creutz, 2003)(inspired by work of, e.g., M. R. Brent and M. G. Snover)

Length

prior

”Invent” a set distinct strings = morphs

Morph lexicon

Frequency

prior

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz


Reflections on solutions 1 and 2

Reflections on solutions 1 and 2

  • ”Dumb” text compression algorithms

    • Common substrings of words appear as one segment, even when compositional structure, e.g.,:

      • keskustelussa (keskustel + u + ssa; ”discuss+ion in”)

      • biggest (bigg + est)

    • Rare substrings of words are split, even when no compositional structure, e.g.,

      • a + den + auer (Adenauer; German politician)

      • in + s + an + e (in + sane)

    • Too weak structural constraints,

      e.g., suffixes recognized in the beginning of words:

      • s + can (scan)

Mathias Creutz


3 category learning probabilistic model

p(STM | PRE)

p(SUF | SUF)

p(’nyky’ | PRE)

p(’mme’ | SUF)

#

nyky

ratkaisu

i

sta

mme

#

3. Category-learning probabilistic model

  • Word structure captured by a regular expression:

  • word = ( prefix* stemsuffix* )+

  • Morph sequences (words) are generated by a Hidden Markov model:

Transition probs

Emission probs

Mathias Creutz


Category algorithm

1. Start with an existing baseline morph segmentation (Creutz, 2003):

nyky + rat + kaisu + ista + mme

Category algorithm

2. Initialize category membership probs for each morph,

e.g., p(PRE | ’nyky’). Assume asymmetries between the categories:

Mathias Creutz


Initialization of category membership probs

Initialization of category membership probs

  • Introduce a noise category for cases where none of the proper classes is likely:

  • Distribute remaining probability mass proportionally, e.g.,

Mathias Creutz


Category algorithm continued

4. Split morphs that consist of other known morphs. Then EM:

nyky + rat + kaisu + i+sta + mme

5. Join noise morphs with their neighbours. Then EM:

nyky + ratkaisu + i+sta + mme

Category algorithm (continued)

1. Start with an existing baseline morph segmentation:

nyky + rat + kaisu + ista + mme

2. Initialize category membership probs for each morph.

3. Tag morphs as prefix, stem, suffix, ornoise. Then run EM on taggings:

nyky + rat + kaisu + ista + mme

Mathias Creutz


Experiments

believ

hop

liv

mov

us

e

ed

es

ing

Experiments

  • Algorithms

    • Baseline model (Bayesian formulation)

    • Category-Learning model

    • Goldsmith’s ”Linguistica” (MDL formul.)

  • Data

    • Finnish data sets (CSC + STT)

      • 10 000 words, 50 000 words, 250 000 words, 16 million words

    • English data sets (Brown corpus)

      • 10 000 words, 50 000 words, 250 000 words

Mathias Creutz


Gold standard used in evaluation

”Gold standard” used in evaluation

  • Morpheme segmentation obtained for Finnish and English words

    • by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.)

  • Some ”fuzzy morpheme boundaries” allowed

    • mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g.,

      • Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor)

      • invite + s or invit + es; invite or invit + e (cf. invit + ing)

  • Compute precision and recall of correctly discovered morpheme boundaries

Mathias Creutz


Results evaluated against the gold standard

Results (evaluated against the gold-standard)

Baseline

16M

10k

Categories

10k

250k

250k

16M

Categories

250k

10k

250k

Linguistica

10k

250k

Linguistica

10k

250k

10k

Baseline

Mathias Creutz


Discussion

Discussion

  • The Category algorithm

    • overcomes many of the shortcomings of the Baseline algorithm

      • excessive or too little segmentation

      • suffixes in beginning of words

    • generalizes more than Linguistica, e.g.,

      • allus+ ion + s (Categories) vs. allusions (Linguistica)

      • Dem+i (Categories) vs. Demi (Linguistica)

    • invents its own solutions

      • aihe+e+sta vs. aihe+i+sta (”about [the] topic/-s”)

      • phrase, phrase+s, phrase+d

Mathias Creutz


Future directions

Future directions

  • The Category algorithm could be expressed more elegantly

    • not as a post-processing procedure making use of a baseline segmentation

  • Segmentation into morphs is useful

    • e.g., n-gram language modeling in speech recognition

  • Detection of allomorphy, i.e., segmentation into morphemes would be even more useful

    • e.g., information retrieval (?)

Mathias Creutz


Induction of a simple morphology for highly inflecting languages

Public demo

  • A demo of the baseline and category-learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/.

  • Test it on your own Finnish or English input!

Mathias Creutz


Search for the optimal segmentation of the words in a corpus

Randomly

shuffle words

Recursive binary splitting

words

opening

openminded

openminded

reopened

reopened

conferences

reopen

minded

Morphs

mind

open

re

ed

Search for the optimal segmentation of the words in a corpus

Convergence

of descr. length?

yes

Done

no

Mathias Creutz


  • Login