1 / 20

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis. Stefan Bordag University of Leipzig Components Detailing Compound splitting Iterated LSV Split trie taining Morpheme Analysis Results Discussion. 1. Components.

jacob
Download Presentation

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig • Components • Detailing • Compound splitting • Iterated LSV • Split trie taining • Morpheme Analysis • Results • Discussion

  2. 1. Components • The main components of the current LSV-based segmentation algorithm • Compound splitter (new) • LSV component (new: iterated) • Trie classificator (new: split in two phases) • Morpheme analysis (entirely new) is based on • Morpheme segmentation (see above) • Clustering of morphs to morphemes • Contextual similarity of morphemes • Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else

  3. 2.1. Compound Splitter • Based on the observation that for LSV especially long words pose a problem • Simple heuristic: whenever a word is decomposable into several words which have • minimum length of 4 • minimum frequency of 10 (or some other arbitrary figures) results in many missed, but at least some correct divisions (Precision at this point being more important than Recall) • P=88% R=10% F=18% • Decompositions which have more words with higher frequencies win in cases where several decompositions possible

  4. root ly clear late late ear ¤ ¤ ¤ cl ¤ ¤ 2.2. Original solution in two parts clear-ly lately early … compute LSV s = LSV * freq * multiletter * bigram The talk 1 Talk was 1 … Talk speech 20 Was is 15 … The talk wasvery informative similar words co-occurrences sentences train classifier clear-ly late-ly early … apply classifier

  5. 2.3. Original Letter successor variety • Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. • Input 150 contextually most similar words • Observing how many different letters occur after a part of the string: • #cle- only 1 letter • -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

  6. 2.4. Balancing factors • LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: • freq: Frequency differences between beginning and middle of word • multiletter: Representation of single phonemes with several letters • bigram: Certain fixed combinations of letters • Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

  7. 2.5. Iterated LSV • The Iteration of LSV based previously found information • For example when computing • ignited with the most similar words already analysed into: • caus-ed, struck, injur-ed, blazed, fire, … • Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme • Implementation in the form of a weight iterLSV iterLSV = #wordsEndingIsMorph / #wordsSameEnding • hence: s = LSV * freq * multiletter * bigram * iterLSV

  8. 2.6. Pat. Comp. Trie as Classificator root ly clear late root late ear ¤ ¤ ly ly=2 clear ¤=1 late ¤=1 ¤ cl ¤ late ly=1 ear ly=1 ¤ ¤ ¤=1 ¤=1 ¤ ¤ ly=1 cl ly=1 ¤ ¤=1 Apply deepest found node retrieve known information ¤ ly=1 Amazing?ly add known information dear?ly clear-ly, late-ly, early, Clear, late amazing-ly dearly

  9. 2.7. Splitting trie application • The trie classificator could decide for ignit-ed based on top-node in trie from back • –d with classes –ed:50;-d:10;-ted:5;… • Hence not taking any context in the word into account • New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if • at least one more letter additionally to the letters in the proposed morpheme matches in the word • save_trie andrec_trie are thentrained andapplied conecutively ed ed=2 save_trie => ignited r ed=1 s ed=1 rec_trie => ignit-ed caus-ed injur-ed

  10. 2.8. Effect of the improvements • compounds • P=88% R=10% F=18% • compounds + recTrie • P=66% R=28% F=39% • compounds + lsv_0 + recTrie • P=71% R=58% F=64% • compounds + lsv_2 + recTrie • P=69% R=63% F=66% • compounds + lsv_2 + saveTrie + recTrie • P=69% R=66% F=67% • Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller • However, applying on three times bigger corpus only increases number of split words, not quality of those split!

  11. 3. Morpheme Analysis • Assumes visible morphs (i.e. output of a segmentation algorithm) • This enables to compute co-occurrence of morphs • Which enables computing contextual similarity of morps • which enables clustering morphs to morphemes • Traditional representation of morphemes • barefooted BARE FOOT +PAST • flying FLY_V +PCP1 • footprints FOOT PRINT +PL • For processing equivalent representation of morphemes • barefooted bare 5foot.6foot.foot ed • flying fly inag.ing.ingu.iong • footprints 5foot.6foot.foot prints

  12. 3.1. Computing alternation for each morph m for each cont. similar morph s of m if LD_Similar(s,m) r = makeRule(s,m) store(r->s,m) for each word w for each morph m of w if in_store(m) sig = createSignature(m) write sig else write m m=foot s={feet,5foot,…} LD(foot,5foot)=1 _-5 -> foot,5foot barefooted {bare,foot,ed} foot has _-5 and _-6sig: foot.5foot.6foot

  13. 3.2. Real examples Rules: • m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai • _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin • m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder Signatures: • muessen muess.muesst.muss en • ihrer ihre.ihrem.ihren.ihrer.ihres • werde werd.wird.wuerd e • Ihren ihre.ihrem.ihren.ihrer.ihres.ihrn

  14. 3.3. More examples kabinettsaufteilung kabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs entwaffnungsbericht enkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht grundstuecksverwaltung gruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs grundt gruend.grund t

  15. 4. Results (competition 1) • GERMAN AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 63.20% 37.69% 47.22% Bernhard 2 49.08% 57.35% 52.89% Bordag 5 60.71% 40.58% 48.64% Bordag 5a 60.45% 41.57% 49.27% McNamee 3 45.78% 9.28% 15.43% Zeman - 52.79% 28.46% 36.98% Monson&co Morfessor 67.16% 36.83% 47.57% Monson&co ParaMor 59.05% 32.81% 42.19% Monson&co Paramor&Morfessor 51.45% 55.55% 53.42% Morfessor MAP 67.56% 36.92% 47.75% • ENGLISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 72.05% 52.47% 60.72% Bernhard 2 61.63% 60.01% 60.81% Bordag 5 59.80% 31.50% 41.27% Bordag 5a 59.69% 32.12% 41.77% McNamee 3 43.47% 17.55% 25.01% Zeman - 52.98% 42.07% 46.90% Monson&co Morfessor 77.22% 33.95% 47.16% Monson&co ParaMor 48.46% 52.95% 50.61% Monson&co Paramor&Morfessor 41.58% 65.08% 50.74% Morfessor MAP 82.17% 33.08% 47.17%

  16. 4.1. Results (competition 1) • TURKISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 78.22% 10.93% 19.18% Bernhard 2 73.69% 14.80% 24.65% Bordag 5 81.44% 17.45% 28.75% Bordag 5a 81.31% 17.58% 28.91% McNamee 3 65.00% 10.83% 18.57% McNamee 4 85.49% 6.59% 12.24% McNamee 5 94.80% 3.31% 6.39% Zeman - 65.81% 18.79% 29.23% Morfessor MAP 76.36% 24.50% 37.10% • FINNISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 75.99% 25.01% 37.63% Bernhard 2 59.65% 40.44% 48.20% Bordag 5 71.72% 23.61% 35.52% Bordag 5a 71.32% 24.40% 36.36% McNamee 3 45.53% 8.56% 14.41% McNamee 4 68.09% 5.68% 10.49% McNamee 5 86.69% 3.35% 6.45% Zeman - 58.84% 20.92% 30.87% Morfessor MAP 76.83% 27.54% 40.55%

  17. 5.1. Problems of Morpheme Analysis • Surprise #1: nearly no effect on evaluation results! Possible reasons: • rules: not taking type frequency into account (hence overvaluing errors) • rules: not taking context into account (instead of _-5 better _5f-_fo) • segmentation: produces many errors, analysis has to put up with a lot of noise

  18. 5.2. Problems of Segmentation • Surprise #2: Size of corpus has no large influence on quality of segmentations • it influences only how many nearly perfect segmentation are found by LSV • but that is by far outweighted by the errors of the trie • Strength of LSV is to segment irregular words properly • because they have high frequency and are usually short • Strength of most other proposed methods with segmenting long and infrequent words • Combination evidently desireable

  19. 5.3. Further avenues? • Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB • For languages that merge morphemes this is inappropriate • Better solution perhaps similar to U-DOP by Rens Bod? • that means generating all possible parsing trees for each token • then collating them for the type and generating possible optimal parses • possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.

  20. THANK YOU!

More Related