unsupervised and knowledge free morpheme segmentation and analysis
Download
Skip this Video
Download Presentation
Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

Loading in 2 Seconds...

play fullscreen
1 / 20

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis. Stefan Bordag University of Leipzig Components Detailing Compound splitting Iterated LSV Split trie taining Morpheme Analysis Results Discussion. 1. Components.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Unsupervised and Knowledge-free Morpheme Segmentation and Analysis' - zeki


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
unsupervised and knowledge free morpheme segmentation and analysis
Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

Stefan Bordag

University of Leipzig

  • Components
  • Detailing
    • Compound splitting
    • Iterated LSV
    • Split trie taining
  • Morpheme Analysis
  • Results
  • Discussion
1 components
1. Components
  • The main components of the current LSV-based segmentation algorithm
    • Compound splitter (new)
    • LSV component (new: iterated)
    • Trie classificator (new: split in two phases)
  • Morpheme analysis (entirely new) is based on
    • Morpheme segmentation (see above)
    • Clustering of morphs to morphemes
    • Contextual similarity of morphemes
  • Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else
2 1 compound splitter
2.1. Compound Splitter
  • Based on the observation that for LSV especially long words pose a problem
  • Simple heuristic: whenever a word is decomposable into several words which have
    • minimum length of 4
    • minimum frequency of 10 (or some other arbitrary figures)

results in many missed, but at least some correct divisions (Precision at this point being more important than Recall)

    • P=88% R=10% F=18%
  • Decompositions which have more words with higher frequencies win in cases where several decompositions possible
2 2 original solution in two parts

root

ly

clear

late

late

ear

¤

¤

¤

cl

¤

¤

2.2. Original solution in two parts

clear-ly

lately

early

compute LSV

s = LSV * freq * multiletter *

bigram

The talk 1

Talk was 1

Talk speech 20

Was is 15

The talk wasvery

informative

similar words

co-occurrences

sentences

train classifier

clear-ly

late-ly

early

apply classifier

2 3 original letter successor variety
2.3. Original Letter successor variety
  • Letter successor variety: Harris (55)

where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold.

  • Input 150 contextually most similar words
  • Observing how many different letters occur after a part of the string:
    • #cle- only 1 letter
    • -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#)

# c l e a r l y #

28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters)

1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

2 4 balancing factors
2.4. Balancing factors
  • LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise:
    • freq: Frequency differences between beginning and middle of word
    • multiletter: Representation of single phonemes with several letters
    • bigram: Certain fixed combinations of letters
    • Final score s for each possible boundary is then:

s = LSV * freq * multiletter * bigram

2 5 iterated lsv
2.5. Iterated LSV
  • The Iteration of LSV based previously found information
  • For example when computing
    • ignited with the most similar words already analysed into:
      • caus-ed, struck, injur-ed, blazed, fire, …
  • Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme
  • Implementation in the form of a weight iterLSV

iterLSV = #wordsEndingIsMorph / #wordsSameEnding

  • hence:

s = LSV * freq * multiletter * bigram * iterLSV

2 6 pat comp trie as classificator
2.6. Pat. Comp. Trie as Classificator

root

ly

clear

late

root

late

ear

¤

¤

ly

ly=2

clear

¤=1

late

¤=1

¤

cl

¤

late

ly=1

ear

ly=1

¤

¤

¤=1

¤=1

¤

¤

ly=1

cl

ly=1

¤

¤=1

Apply deepest

found node

retrieve

known

information

¤

ly=1

Amazing?ly

add known

information

dear?ly

clear-ly, late-ly, early,

Clear, late

amazing-ly

dearly

2 7 splitting trie application
2.7. Splitting trie application
  • The trie classificator could decide for ignit-ed based on top-node in trie from back
    • –d with classes –ed:50;-d:10;-ted:5;…
    • Hence not taking any context in the word into account
  • New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if
    • at least one more letter additionally to the letters in the proposed morpheme matches in the word
  • save_trie andrec_trie are thentrained andapplied conecutively

ed

ed=2

save_trie => ignited

r

ed=1

s

ed=1

rec_trie => ignit-ed

caus-ed

injur-ed

2 8 effect of the improvements
2.8. Effect of the improvements
  • compounds
    • P=88% R=10% F=18%
  • compounds + recTrie
    • P=66% R=28% F=39%
  • compounds + lsv_0 + recTrie
    • P=71% R=58% F=64%
  • compounds + lsv_2 + recTrie
    • P=69% R=63% F=66%
  • compounds + lsv_2 + saveTrie + recTrie
    • P=69% R=66% F=67%
  • Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller
  • However, applying on three times bigger corpus only increases number of split words, not quality of those split!
3 morpheme analysis
3. Morpheme Analysis
  • Assumes visible morphs (i.e. output of a segmentation algorithm)
    • This enables to compute co-occurrence of morphs
    • Which enables computing contextual similarity of morps
    • which enables clustering morphs to morphemes
  • Traditional representation of morphemes
    • barefooted BARE FOOT +PAST
    • flying FLY_V +PCP1
    • footprints FOOT PRINT +PL
  • For processing equivalent representation of morphemes
    • barefooted bare 5foot.6foot.foot ed
    • flying fly inag.ing.ingu.iong
    • footprints 5foot.6foot.foot prints
3 1 computing alternation
3.1. Computing alternation

for each morph m

for each cont. similar morph s of m

if LD_Similar(s,m)

r = makeRule(s,m)

store(r->s,m)

for each word w

for each morph m of w

if in_store(m)

sig = createSignature(m)

write sig

else

write m

m=foot

s={feet,5foot,…}

LD(foot,5foot)=1

_-5 -> foot,5foot

barefooted

{bare,foot,ed}

foot has _-5 and _-6sig: foot.5foot.6foot

3 2 real examples
3.2. Real examples

Rules:

  • m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai
  • _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin
  • m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder

Signatures:

  • muessen muess.muesst.muss en
  • ihrer ihre.ihrem.ihren.ihrer.ihres
  • werde werd.wird.wuerd e
  • Ihren ihre.ihrem.ihren.ihrer.ihres.ihrn
3 3 more examples
3.3. More examples

kabinettsaufteilung kabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs

entwaffnungsbericht enkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht

grundstuecksverwaltung gruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs

grundt gruend.grund t

4 results competition 1
4. Results (competition 1)
  • GERMAN

AUTHOR METHOD PRECISION RECALL F-MEASURE

Bernhard 1 63.20% 37.69% 47.22%

Bernhard 2 49.08% 57.35% 52.89%

Bordag 5 60.71% 40.58% 48.64%

Bordag 5a 60.45% 41.57% 49.27%

McNamee 3 45.78% 9.28% 15.43%

Zeman - 52.79% 28.46% 36.98%

Monson&co Morfessor 67.16% 36.83% 47.57%

Monson&co ParaMor 59.05% 32.81% 42.19%

Monson&co Paramor&Morfessor 51.45% 55.55% 53.42%

Morfessor MAP 67.56% 36.92% 47.75%

  • ENGLISH

AUTHOR METHOD PRECISION RECALL F-MEASURE

Bernhard 1 72.05% 52.47% 60.72%

Bernhard 2 61.63% 60.01% 60.81%

Bordag 5 59.80% 31.50% 41.27%

Bordag 5a 59.69% 32.12% 41.77%

McNamee 3 43.47% 17.55% 25.01%

Zeman - 52.98% 42.07% 46.90%

Monson&co Morfessor 77.22% 33.95% 47.16%

Monson&co ParaMor 48.46% 52.95% 50.61%

Monson&co Paramor&Morfessor 41.58% 65.08% 50.74%

Morfessor MAP 82.17% 33.08% 47.17%

4 1 results competition 1
4.1. Results (competition 1)
  • TURKISH

AUTHOR METHOD PRECISION RECALL F-MEASURE

Bernhard 1 78.22% 10.93% 19.18%

Bernhard 2 73.69% 14.80% 24.65%

Bordag 5 81.44% 17.45% 28.75%

Bordag 5a 81.31% 17.58% 28.91%

McNamee 3 65.00% 10.83% 18.57%

McNamee 4 85.49% 6.59% 12.24%

McNamee 5 94.80% 3.31% 6.39%

Zeman - 65.81% 18.79% 29.23%

Morfessor MAP 76.36% 24.50% 37.10%

  • FINNISH

AUTHOR METHOD PRECISION RECALL F-MEASURE

Bernhard 1 75.99% 25.01% 37.63%

Bernhard 2 59.65% 40.44% 48.20%

Bordag 5 71.72% 23.61% 35.52%

Bordag 5a 71.32% 24.40% 36.36%

McNamee 3 45.53% 8.56% 14.41%

McNamee 4 68.09% 5.68% 10.49%

McNamee 5 86.69% 3.35% 6.45%

Zeman - 58.84% 20.92% 30.87%

Morfessor MAP 76.83% 27.54% 40.55%

5 1 problems of morpheme analysis
5.1. Problems of Morpheme Analysis
  • Surprise #1: nearly no effect on evaluation results! Possible reasons:
    • rules: not taking type frequency into account (hence overvaluing errors)
    • rules: not taking context into account (instead of _-5 better _5f-_fo)
    • segmentation: produces many errors, analysis has to put up with a lot of noise
5 2 problems of segmentation
5.2. Problems of Segmentation
  • Surprise #2: Size of corpus has no large influence on quality of segmentations
    • it influences only how many nearly perfect segmentation are found by LSV
    • but that is by far outweighted by the errors of the trie
  • Strength of LSV is to segment irregular words properly
    • because they have high frequency and are usually short
  • Strength of most other proposed methods with segmenting long and infrequent words
    • Combination evidently desireable
5 3 further avenues
5.3. Further avenues?
  • Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB
  • For languages that merge morphemes this is inappropriate
  • Better solution perhaps similar to U-DOP by Rens Bod?
    • that means generating all possible parsing trees for each token
    • then collating them for the type and generating possible optimal parses
    • possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.
ad