induction of a simple morphology for highly inflecting languages
Download
Skip this Video
Download Presentation
Induction of a Simple Morphology for Highly-Inflecting Languages

Loading in 2 Seconds...

play fullscreen
1 / 17

Induction of a Simple Morphology for Highly-Inflecting Languages - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Induction of a Simple Morphology for Highly-Inflecting Languages. {Mathias.Creutz, Krista.Lagus}@hut.fi

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Induction of a Simple Morphology for Highly-Inflecting Languages' - lauren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
induction of a simple morphology for highly inflecting languages

nyky+ratkaisu+i+sta+mme

  • kahvi+n+juo+ja+lle+kin
  • tietä+isi+mme+kö+hän
  • open+mind+ed+ness
  • un+believ+able

Induction of a Simple Morphologyfor Highly-Inflecting Languages

{Mathias.Creutz, Krista.Lagus}@hut.fi

Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004.

Barcelona, 26 July 2004

goals and challenges
Goals and challenges
  • Learnrepresentations of
    • the smallest meaningful units of language (morphemes)
    • and their interaction
    • in an unsupervised manner from raw text
    • making as general and language-independent assumptions as possible.
  • Evaluate
    • against a given gold-standard morphological analysis of word forms
      • first step: learn and evaluate a morpheme segmentation

of word forms

    • integrated in NLP applications (speech recognition)

Mathias Creutz

slide3

Focus: Agglutinative morphology

  • Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes:
    • kahvi + n + juo + ja + lle + kin
    • (coffee + of + drink + -er + for + also)
    • nyky+ ratkaisu + i + sta + mme
    • (current + solution + -s + from + our)
    • tietä + isi + mme + kö+ hän
    • (know + would + we + INTERR + indeed)
  • Huge number of different possible word forms
  • Important to know the inner structure of words in NLP
  • The number of morphemes per word varies much

Mathias Creutz

1 mdl model creutz lagus 2002 inspired by work of e g j goldsmith

a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

Learning from data

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

1. MDL model (Creutz & Lagus, 2002)(inspired by work of, e.g., J. Goldsmith)

”Invent” a set distinct strings = morphs

Aim at the

most concise

represent-

ation possible

Morph lexicon

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz

2 probabilistic formulation creutz 2003 inspired by work of e g m r brent and m g snover

a =tä

b =ssä

g =pala

d =peli

e =on

q =tuhat

z =a

a b g d b e q g z

tä ssä pala peli ssä on tuhat pala a

2. Probabilistic formulation (Creutz, 2003)(inspired by work of, e.g., M. R. Brent and M. G. Snover)

Length

prior

”Invent” a set distinct strings = morphs

Morph lexicon

Frequency

prior

Pick morphs from the lexicon

and place them in a sequence

Corpus / word list

Mathias Creutz

reflections on solutions 1 and 2
Reflections on solutions 1 and 2
  • ”Dumb” text compression algorithms
    • Common substrings of words appear as one segment, even when compositional structure, e.g.,:
      • keskustelussa (keskustel + u + ssa; ”discuss+ion in”)
      • biggest (bigg + est)
    • Rare substrings of words are split, even when no compositional structure, e.g.,
      • a + den + auer (Adenauer; German politician)
      • in + s + an + e (in + sane)
    • Too weak structural constraints,

e.g., suffixes recognized in the beginning of words:

      • s + can (scan)

Mathias Creutz

3 category learning probabilistic model

p(STM | PRE)

p(SUF | SUF)

p(’nyky’ | PRE)

p(’mme’ | SUF)

#

nyky

ratkaisu

i

sta

mme

#

3. Category-learning probabilistic model
  • Word structure captured by a regular expression:
  • word = ( prefix* stemsuffix* )+
  • Morph sequences (words) are generated by a Hidden Markov model:

Transition probs

Emission probs

Mathias Creutz

category algorithm

1. Start with an existing baseline morph segmentation (Creutz, 2003):

nyky + rat + kaisu + ista + mme

Category algorithm

2. Initialize category membership probs for each morph,

e.g., p(PRE | ’nyky’). Assume asymmetries between the categories:

Mathias Creutz

initialization of category membership probs
Initialization of category membership probs
  • Introduce a noise category for cases where none of the proper classes is likely:
  • Distribute remaining probability mass proportionally, e.g.,

Mathias Creutz

category algorithm continued

4. Split morphs that consist of other known morphs. Then EM:

nyky + rat + kaisu + i+sta + mme

5. Join noise morphs with their neighbours. Then EM:

nyky + ratkaisu + i+sta + mme

Category algorithm (continued)

1. Start with an existing baseline morph segmentation:

nyky + rat + kaisu + ista + mme

2. Initialize category membership probs for each morph.

3. Tag morphs as prefix, stem, suffix, ornoise. Then run EM on taggings:

nyky + rat + kaisu + ista + mme

Mathias Creutz

experiments

believ

hop

liv

mov

us

e

ed

es

ing

Experiments
  • Algorithms
    • Baseline model (Bayesian formulation)
    • Category-Learning model
    • Goldsmith’s ”Linguistica” (MDL formul.)
  • Data
    • Finnish data sets (CSC + STT)
      • 10 000 words, 50 000 words, 250 000 words, 16 million words
    • English data sets (Brown corpus)
      • 10 000 words, 50 000 words, 250 000 words

Mathias Creutz

gold standard used in evaluation
”Gold standard” used in evaluation
  • Morpheme segmentation obtained for Finnish and English words
    • by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.)
  • Some ”fuzzy morpheme boundaries” allowed
    • mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g.,
      • Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor)
      • invite + s or invit + es; invite or invit + e (cf. invit + ing)
  • Compute precision and recall of correctly discovered morpheme boundaries

Mathias Creutz

results evaluated against the gold standard
Results (evaluated against the gold-standard)

Baseline

16M

10k

Categories

10k

250k

250k

16M

Categories

250k

10k

250k

Linguistica

10k

250k

Linguistica

10k

250k

10k

Baseline

Mathias Creutz

discussion
Discussion
  • The Category algorithm
    • overcomes many of the shortcomings of the Baseline algorithm
      • excessive or too little segmentation
      • suffixes in beginning of words
    • generalizes more than Linguistica, e.g.,
      • allus+ ion + s (Categories) vs. allusions (Linguistica)
      • Dem+i (Categories) vs. Demi (Linguistica)
    • invents its own solutions
      • aihe+e+sta vs. aihe+i+sta (”about [the] topic/-s”)
      • phrase, phrase+s, phrase+d

Mathias Creutz

future directions
Future directions
  • The Category algorithm could be expressed more elegantly
    • not as a post-processing procedure making use of a baseline segmentation
  • Segmentation into morphs is useful
    • e.g., n-gram language modeling in speech recognition
  • Detection of allomorphy, i.e., segmentation into morphemes would be even more useful
    • e.g., information retrieval (?)

Mathias Creutz

slide16

Public demo

  • A demo of the baseline and category-learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/.
  • Test it on your own Finnish or English input!

Mathias Creutz

search for the optimal segmentation of the words in a corpus

Randomly

shuffle words

Recursive binary splitting

words

opening

openminded

openminded

reopened

reopened

conferences

reopen

minded

Morphs

mind

open

re

ed

Search for the optimal segmentation of the words in a corpus

Convergence

of descr. length?

yes

Done

no

Mathias Creutz

ad