Unsupervised and knowledge free morpheme segmentation and analysis l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis. Stefan Bordag University of Leipzig Components Detailing Compound splitting Iterated LSV Split trie taining Morpheme Analysis Results Discussion. 1. Components.

Download Presentation

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Unsupervised and knowledge free morpheme segmentation and analysis l.jpg

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

Stefan Bordag

University of Leipzig

  • Components

  • Detailing

    • Compound splitting

    • Iterated LSV

    • Split trie taining

  • Morpheme Analysis

  • Results

  • Discussion


1 components l.jpg

1. Components

  • The main components of the current LSV-based segmentation algorithm

    • Compound splitter (new)

    • LSV component (new: iterated)

    • Trie classificator (new: split in two phases)

  • Morpheme analysis (entirely new) is based on

    • Morpheme segmentation (see above)

    • Clustering of morphs to morphemes

    • Contextual similarity of morphemes

  • Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else


2 1 compound splitter l.jpg

2.1. Compound Splitter

  • Based on the observation that for LSV especially long words pose a problem

  • Simple heuristic: whenever a word is decomposable into several words which have

    • minimum length of 4

    • minimum frequency of 10 (or some other arbitrary figures)

      results in many missed, but at least some correct divisions (Precision at this point being more important than Recall)

    • P=88% R=10% F=18%

  • Decompositions which have more words with higher frequencies win in cases where several decompositions possible


2 2 original solution in two parts l.jpg

root

ly

clear

late

late

ear

¤

¤

¤

cl

¤

¤

2.2. Original solution in two parts

clear-ly

lately

early

compute LSV

s = LSV * freq * multiletter *

bigram

The talk 1

Talk was 1

Talk speech 20

Was is 15

The talk wasvery

informative

similar words

co-occurrences

sentences

train classifier

clear-ly

late-ly

early

apply classifier


2 3 original letter successor variety l.jpg

2.3. Original Letter successor variety

  • Letter successor variety: Harris (55)

    where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold.

  • Input 150 contextually most similar words

  • Observing how many different letters occur after a part of the string:

    • #cle- only 1 letter

    • -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#)

      # c l e a r l y #

      28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters)

      1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)


2 4 balancing factors l.jpg

2.4. Balancing factors

  • LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise:

    • freq: Frequency differences between beginning and middle of word

    • multiletter: Representation of single phonemes with several letters

    • bigram: Certain fixed combinations of letters

    • Final score s for each possible boundary is then:

      s = LSV * freq * multiletter * bigram


2 5 iterated lsv l.jpg

2.5. Iterated LSV

  • The Iteration of LSV based previously found information

  • For example when computing

    • ignited with the most similar words already analysed into:

      • caus-ed, struck, injur-ed, blazed, fire, …

  • Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme

  • Implementation in the form of a weight iterLSV

    iterLSV = #wordsEndingIsMorph / #wordsSameEnding

  • hence:

    s = LSV * freq * multiletter * bigram * iterLSV


2 6 pat comp trie as classificator l.jpg

2.6. Pat. Comp. Trie as Classificator

root

ly

clear

late

root

late

ear

¤

¤

ly

ly=2

clear

¤=1

late

¤=1

¤

cl

¤

late

ly=1

ear

ly=1

¤

¤

¤=1

¤=1

¤

¤

ly=1

cl

ly=1

¤

¤=1

Apply deepest

found node

retrieve

known

information

¤

ly=1

Amazing?ly

add known

information

dear?ly

clear-ly, late-ly, early,

Clear, late

amazing-ly

dearly


2 7 splitting trie application l.jpg

2.7. Splitting trie application

  • The trie classificator could decide for ignit-ed based on top-node in trie from back

    • –d with classes –ed:50;-d:10;-ted:5;…

    • Hence not taking any context in the word into account

  • New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if

    • at least one more letter additionally to the letters in the proposed morpheme matches in the word

  • save_trie andrec_trie are thentrained andapplied conecutively

ed

ed=2

save_trie => ignited

r

ed=1

s

ed=1

rec_trie => ignit-ed

caus-ed

injur-ed


2 8 effect of the improvements l.jpg

2.8. Effect of the improvements

  • compounds

    • P=88% R=10% F=18%

  • compounds + recTrie

    • P=66% R=28% F=39%

  • compounds + lsv_0 + recTrie

    • P=71% R=58% F=64%

  • compounds + lsv_2 + recTrie

    • P=69% R=63% F=66%

  • compounds + lsv_2 + saveTrie + recTrie

    • P=69% R=66% F=67%

  • Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller

  • However, applying on three times bigger corpus only increases number of split words, not quality of those split!


3 morpheme analysis l.jpg

3. Morpheme Analysis

  • Assumes visible morphs (i.e. output of a segmentation algorithm)

    • This enables to compute co-occurrence of morphs

    • Which enables computing contextual similarity of morps

    • which enables clustering morphs to morphemes

  • Traditional representation of morphemes

    • barefooted BARE FOOT +PAST

    • flying FLY_V +PCP1

    • footprints FOOT PRINT +PL

  • For processing equivalent representation of morphemes

    • barefootedbare 5foot.6foot.foot ed

    • flyingfly inag.ing.ingu.iong

    • footprints5foot.6foot.foot prints


3 1 computing alternation l.jpg

3.1. Computing alternation

for each morph m

for each cont. similar morph s of m

if LD_Similar(s,m)

r = makeRule(s,m)

store(r->s,m)

for each word w

for each morph m of w

if in_store(m)

sig = createSignature(m)

write sig

else

write m

m=foot

s={feet,5foot,…}

LD(foot,5foot)=1

_-5 -> foot,5foot

barefooted

{bare,foot,ed}

foot has _-5 and _-6sig: foot.5foot.6foot


3 2 real examples l.jpg

3.2. Real examples

Rules:

  • m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai

  • _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin

  • m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder

    Signatures:

  • muessenmuess.muesst.muss en

  • ihrerihre.ihrem.ihren.ihrer.ihres

  • werdewerd.wird.wuerd e

  • Ihrenihre.ihrem.ihren.ihrer.ihres.ihrn


3 3 more examples l.jpg

3.3. More examples

kabinettsaufteilungkabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs

entwaffnungsberichtenkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht

grundstuecksverwaltunggruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs

grundtgruend.grund t


4 results competition 1 l.jpg

4. Results (competition 1)

  • GERMAN

    AUTHOR METHOD PRECISION RECALL F-MEASURE

    Bernhard 1 63.20% 37.69% 47.22%

    Bernhard 2 49.08% 57.35% 52.89%

    Bordag 5 60.71% 40.58% 48.64%

    Bordag 5a 60.45% 41.57% 49.27%

    McNamee 3 45.78% 9.28% 15.43%

    Zeman - 52.79% 28.46% 36.98%

    Monson&co Morfessor 67.16% 36.83% 47.57%

    Monson&co ParaMor 59.05% 32.81% 42.19%

    Monson&co Paramor&Morfessor 51.45% 55.55% 53.42%

    Morfessor MAP 67.56% 36.92% 47.75%

  • ENGLISH

    AUTHOR METHOD PRECISION RECALL F-MEASURE

    Bernhard 1 72.05% 52.47% 60.72%

    Bernhard 2 61.63% 60.01% 60.81%

    Bordag 5 59.80% 31.50% 41.27%

    Bordag 5a 59.69% 32.12% 41.77%

    McNamee 3 43.47% 17.55% 25.01%

    Zeman - 52.98% 42.07% 46.90%

    Monson&co Morfessor 77.22% 33.95% 47.16%

    Monson&co ParaMor 48.46% 52.95% 50.61%

    Monson&co Paramor&Morfessor 41.58% 65.08% 50.74%

    Morfessor MAP 82.17% 33.08% 47.17%


4 1 results competition 1 l.jpg

4.1. Results (competition 1)

  • TURKISH

    AUTHOR METHOD PRECISION RECALL F-MEASURE

    Bernhard 1 78.22% 10.93% 19.18%

    Bernhard 2 73.69% 14.80% 24.65%

    Bordag 5 81.44% 17.45% 28.75%

    Bordag 5a 81.31% 17.58% 28.91%

    McNamee 3 65.00% 10.83% 18.57%

    McNamee 4 85.49% 6.59% 12.24%

    McNamee 5 94.80% 3.31% 6.39%

    Zeman - 65.81% 18.79% 29.23%

    Morfessor MAP 76.36% 24.50% 37.10%

  • FINNISH

    AUTHOR METHOD PRECISION RECALL F-MEASURE

    Bernhard 1 75.99% 25.01% 37.63%

    Bernhard 2 59.65% 40.44% 48.20%

    Bordag 5 71.72% 23.61% 35.52%

    Bordag 5a 71.32% 24.40% 36.36%

    McNamee 3 45.53% 8.56% 14.41%

    McNamee 4 68.09% 5.68% 10.49%

    McNamee 5 86.69% 3.35% 6.45%

    Zeman - 58.84% 20.92% 30.87%

    Morfessor MAP 76.83% 27.54% 40.55%


5 1 problems of morpheme analysis l.jpg

5.1. Problems of Morpheme Analysis

  • Surprise #1: nearly no effect on evaluation results! Possible reasons:

    • rules: not taking type frequency into account (hence overvaluing errors)

    • rules: not taking context into account (instead of _-5 better _5f-_fo)

    • segmentation: produces many errors, analysis has to put up with a lot of noise


5 2 problems of segmentation l.jpg

5.2. Problems of Segmentation

  • Surprise #2: Size of corpus has no large influence on quality of segmentations

    • it influences only how many nearly perfect segmentation are found by LSV

    • but that is by far outweighted by the errors of the trie

  • Strength of LSV is to segment irregular words properly

    • because they have high frequency and are usually short

  • Strength of most other proposed methods with segmenting long and infrequent words

    • Combination evidently desireable


5 3 further avenues l.jpg

5.3. Further avenues?

  • Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB

  • For languages that merge morphemes this is inappropriate

  • Better solution perhaps similar to U-DOP by Rens Bod?

    • that means generating all possible parsing trees for each token

    • then collating them for the type and generating possible optimal parses

    • possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.


Slide20 l.jpg

THANK YOU!


  • Login