inducing the morphological lexicon of a natural language from unannotated text l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Inducing the Morphological Lexicon of a Natural Language from Unannotated Text PowerPoint Presentation
Download Presentation
Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

Loading in 2 Seconds...

play fullscreen
1 / 19

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text - PowerPoint PPT Presentation


  • 266 Views
  • Uploaded on

nyky + ratkaisu + i + sta + mme kahvi + n + juo + ja + lle + kin tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias . Creutz , Krista . Lagus }@hut.fi

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Inducing the Morphological Lexicon of a Natural Language from Unannotated Text' - albert


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
inducing the morphological lexicon of a natural language from unannotated text

nyky+ratkaisu+i+sta+mme

  • kahvi+n+juo+ja+lle+kin
  • tietä+isi+mme+kö+hän
  • open+mind+ed+ness
  • un+believ+able

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

{ Mathias.Creutz, Krista.Lagus }@hut.fi

International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05)

Espoo, 17 June 2005

challenge for nlp too many words
Challenge for NLP: too many words
  • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes:
    • kahvi + n + juo + ja + lle + kin

(coffee + of + drink + -er + for + also)

    • nyky+ ratkaisu + i + sta + mme

(current + solution + -s + from + our)

    • tietä + isi + mme + kö+ hän

(know + would + we + INTERR + indeed)

  • Huge number of different possible word forms
  • Important to know the inner structure of words
  • The number of morphemes per word varies much

Mathias Creutz

slide3
Goal

Morfessor

  • Learnrepresentations of
    • the smallest individually meaningful units of language (morphemes)
    • and their interaction
    • in an unsupervised and data-driven manner from raw text
    • making as general and language-independent assumptions as possible.

Mathias Creutz

state of the art
State of the art
  • Rule-based systems
    • accurate, language-dependent, adaptivity issues
  • Unsupervised word segmentation
    • sentences can be of different length
    • context-insensitive  poor modeling of syntax:
      • undersegmentation of frequent strings (“forthepurposeof”)
      • oversegmentation of rare strings (“in + s + an + e”)
      • no syntactic / morphotactic constraints(“s + can”)

Morfessor

Baseline

Mathias Creutz

state of the art cont d

believ

hop

liv

mov

us

e

ed

es

ing

State of the art (cont’d)
  • Morphology learning
    • Beyond segmentation: allomorphy (“foot – feet, goose – geese”)
    • Detection of semantic similarity (e.g., Yarowsky & Wicentowski)(“sing – sings – singe – singed”)
    • Learning of paradigms (e.g., John Goldsmith’s Linguistica)

Very restricted syntax / morphotactics in terms of number of morphemes per word form!

Mathias Creutz

morfessor with morpheme categories

P(STM | PRE)

P(SUF | SUF)

Transition probs

P(’over’ | PRE)

P(’s’ | SUF)

Emission probs

#

over

simpl

ific

ation

s

#

Morfessor with morpheme categories
  • Lexicon / Grammar dualism
    • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+
    • Morph sequences (words) are generated by a Hidden Markov model:

Mathias Creutz

lexicon

“Meaning”

“Form”

14029

41

17259

4

136

1

1

4618

1

4

5

1

simpl

over

s

Right perplexity

Left perplexity

Frequency

Length

String

Morphs

...

Lexicon

Mathias Creutz

how meaning affects morphotactic role
How meaning affects morphotactic role
  • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)
  • Assume asymmetries between the categories:

Mathias Creutz

slide9

Distribute remaining probability mass proportionally,

  • e.g.,

How meaning affects role (cont’d)

  • There is an additional non-morpheme category for cases where none of the proper classes is likely:

Mathias Creutz

maximum a posteriori optimization

14029

136

1

4

over

17259

1

4618

1

s

41

4

1

5

simpl

P(STM | PRE)

P(SUF | SUF)

P(’over’ | PRE)

P(’s’ | SUF)

...

s

#

over

simpl

ation

#

ific

Balance accuracy of representation of data against size of lexicon

Maximum a posteriori optimization

Older maximum-

likelihood version:

Categories-ML

(lexicon controlled

heuristically)

Morfessor Categories-MAP:

Mathias Creutz

over and undersegmentation still a problem

Probability of adding an entry to the lexicon:

  • Probability of sequences in the corpus:

vs.

hands

#

#

hand

s

#

#

Over- and undersegmentation still a problem?
  • Rare strings are split into smaller parts (e.g., morgan + a)
  • Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)

Mathias Creutz

solution hierarchical structures in lexicon
Solution: Hierarchical structures in lexicon

oppositio

+

kansanedustaja

op

positio

kansan

edustaja

Non-morpheme

Stem

kansa

n

edusta

ja

Suffix

  • Make morphs consist of submorphs.
  • Expand the tree when performing morpheme segmentation.
  • Do not expand morphs consisting of non-morphemes.

Mathias Creutz

evaluation using hutmegs helsinki university of technology morphological evaluation gold standard
Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)
  • Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs
  • Covers
    • 1.4 million Finnish word forms
    • 120 000 English word forms
  • Publicly available and described in the technical report:

M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.

Mathias Creutz

evaluation against the hutmegs gold standard
Evaluation against the Hutmegs Gold Standard

Finnish

English

Categories-MAP

Heuristic (Categories-ML)

Ctxt-insens. (Baseline)

Paradigms (Linguistica)

Mathias Creutz

example segmentations
Example segmentations

Mathias Creutz

discussion
Discussion
  • Possibility to extend the model
    • rudimentary features used for “meaning”
    • more fine-grained categories
    • beyond concatenative phenomena (e.g., goose – geese)
    • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful)
  • Already now useful in applications
    • automatic speech recognition (Finnish, Turkish)

Mathias Creutz

morpho project page
Morpho project page

http://www.cis.hut.fi/projects/morpho/

Mathias Creutz

demo 7
Demo 7

Mathias Creutz