Unsupervised Knowledge-Free Morpheme Boundary Detection

Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example • Related work • Part One: Generating training data • Part Two: Training and Applying a Classificator • Preliminary results • Further research

Example: clearly early The examples used throughout this presentation are clearly and early • In one case, the stem is clear and in the other early • Other word forms of same lemmas: • clearly: clearest, clear, clearer, clearing • early: earlier, erliest • Semantically related words: • clearly: logically, really, totally, weakly, … • early: morning, noon, day, month, time, … • Correct morpheme boundaries analysis: • clearly → clear-ly but not *clearl-y or *clea-rly • early → early or earl-y but not *ear-ly

1. Three approaches to morpheme boundary detection Three kinds of approaches: • Genetic Algorithms and the Minimum Description Length model • (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) • This approach utilizes only word list, not the context information for each word from corpus. • This possibly results in an upper limit on achievable performance (especially with regards to irregularities). • One advantage is that smaller corpora sufficient • Semantics based • (Schone & Jurafsky 01), (Baroni 03) • General problem of this approach with examples like deeply and deepness where semantic similarity is unlikely • Letter Successor Variety (LSV) based • (Harris 55), (Hafer & Weiss 74) first application, but low performance • Also applied only to a word list • Further hampered by noise in the data

root ly clear late late ear ¤ ¤ ¤ cl ¤ ¤ 2. New solution in two parts clear-ly lately early … compute LSV s = LSV * freq * multiletter * bigram The talk 1 Talk was 1 … Talk speech 20 Was is 15 … The talk wasvery informative similar words cooccurrences sentences train classifier clear-ly late-ly early … apply classifier

2.1. First part: Generating training data with LSV and distributed Semantics Overview: • Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information • Frequency of word A and B is nA and nB • Frequency of cooccurrence of A with B is nAB • Corpus size is n • Significance computation is Poisson approximation of log-likelihood (Dunning 93) (Quasthoff & Wolff 02)

Most significant left neighbors very quite so It‘s most it‘s shows results that‘s stated Quite Most significant right neighbors defined written labeled marked visible demonstrated superior stated shows demonstrates understood Neighbors of “clearly“ It’s clearly labeled clearly very clearly shows

2.2. New solution as combination of two existing approaches Overview: • Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information • Use these neighbor cooccurrences to find words that have similar cooccurrence profiles→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker

Most significant right neighbors defined written labeled marked visible demonstrated superior stated shows demonstrates understood Similar words to “clearly“ Most significant left neighbors very quite so It‘s most it‘s shows results that‘s stated Quite … weakly legally closely clearly greatly linearly really …

2.3. New solution as combination of two existing approaches Overview: • Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information • Use these neighbor cooccurrences to find words that have similar cooccurrence profiles→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker • Sort those words by edit distance and keep 150 most similar → since further words only add random noise

Most significant left neighbors very quite so It‘s most it‘s shows results that‘s stated Quite Most significant right neighbors defined written labeled marked visible demonstrated superior stated shows demonstrates understood Similar words to “clearly“ sorted by edit distance Sorted List clearly closely greatly legally linearly really weakly …

2.4. New solution as combination of two existing approaches Overview: • Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information • Use these neighbor cooccurrences to find words that have similar cooccurrence profiles→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker • Sort those words by edit distance and keep 150 most similar → since further words only add random noise • Compute letter successor variety for each transition between two characters of the input word Report boundaries where the LSV is above threshold

2.5. Letter successor variety • Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. • Input are the 150 most similar words • Observing how many different letters occur after a part of the string: • #c- In the given list after #c- 5 letters • #cl- only 3 letters • #cle- only 1 letter • … • -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

2.5.1. Balancing factors • LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: • freq: Frequency differences between beginning and middle of word • multiletter: Representation of single phonemes with several letters • bigram: Certain fixed combinations of letters • Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

2.5.2. Balancing factors: Frequency LSV is not normalized against frequency • 28 different first letters within 150 words • 5 different second letters within 11 words, beginning with c • 3 different third letters within 4 words, beginning with cl Computing frequency weight freq: • 4 out of 11 begin with #cl- then weight is 4/11 # cl e a r l y # 150 114 1 1 1 1 1 of 11 4 begin with cl 0.1 0.4 0.3 1 1 1 1 1 from left

2.5.3. Balancing factors: Multiletter Phonemes • Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient: Letter split variety: # s c h l i m m e 7 1 7 2 1 1 2 2 1 1 1 2 4 15 Computing overlap factor: 150 27 1818 6 5 5 5 2 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 (18/18), but since sch is one phoneme, it should have been 18/150 ! • Solution: Ranking of bi- and trigrams, highest receives weight of 1.0 • Overlap factor is recomputed as weighted average: • In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0.

2.5.4. Balancing factors: Bigrams • It is obvious that –th– in English is almost never to be divided • Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked. • LSV score then multiplied with resulting weight. • Thus, the German –ch- which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary

2.5.5. Sample computation Compute letter successor variety: # c l e a r - l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 1 1 1 2 1 3 16 10 10 1 2 1 4 6 19 Balancing: Frequencies: 150 11 4 1 1 1 1 1 150 9 2 2 2 1 1 1 2 2 5 7690150 1 2 2 6 19 150 Balancing: Multiletter weights: Bi l 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0 Tri r 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 Bi l 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3 Tri r 0.1 0.1 0.0 0.0 0.2 0.0 0.0 0.2 Balancing: Bigram weight: 0.1 0.5 0.2 0.5 0.0 0.1 0.2 0.5 0.0 0.1 Left and Right LSV scores: 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7 0.3 0.9 0.1 0.012.4 3.7 1.0 0.0 0.7 0.2 Computing right score for clear-ly: 16*(76/90+0.1*76/150)/(1.0+0.1)*(1-0.0)=12.4 Sum scores for left and right: 0.4 1.2 0.1 0.4 13.4 4.6 1.0 0.1 1.2 2.0 threshold: 5 clear-ly early

3. Second Part: Training and Applying classifier • Any word list can be stored in a trie (Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68) • Example: clearly early lately clear late root y r e l a a e r e a t a l l a e c ¤ l l ¤ ¤ ¤ = End or beginning of word ¤ c ¤

3.1. PCT as a Classificator root ly clear late root late ear ¤ ¤ ly ly=2 clear ¤=1 late ¤=1 ¤ cl ¤ late ly=1 ear ly=1 ¤ ¤ ¤=1 ¤=1 ¤ ¤ ly=1 cl ly=1 ¤ ¤=1 Apply deepest found node retrieve known information ¤ ly=1 Amazing?ly add known information dear?ly amazing-ly clear-ly, late-ly, early, Clear, late dearly

4. Evaluation Boundary measuring: each boundary detected can be correct or wrong (precision) or boundaries can be not detected (recall) • First evaluation is global LSV with the proposed improvements

Evaluating LSV Precision vs. Recall

Evaluating LSV F-measure

Evaluating combination Precision vs. Recall

Evaluating combination F-measure

Comparing combination with global LSV

4.1. Results German newspaper corpus with 35 million sentences English newspaper corpus with 13 million sentences

4.2. Statistics

4.3. Assessing true error rate • Typical sample list of words considered as wrong due to CELEX: • Tau-sende Tausend-e • senegales-isch-e senegalesisch-e • sensibelst-en sens-ibel-sten • separat-ist-isch-e separ-at-istisch-e • tris-t trist • triump-hal triumph-al • trock-en trocken • unueber-troff-en un-uebertroffen • trop-f-en tropf-en • trotz-t-en trotz-ten • ver-traeum-t-e vertraeumt-e • Reasons: • Gender –e (in (Creutz & Lagus 05) for example counted as correct) • compounds (sometimes separated, sometimes not) • -t-en Error • With proper names –isch often not analyzed • Connecting elements

4.4. Real example Orien-tal Orien-tal-ische Orien-tal-ist Orien-tal-ist-en Orien-tal-ist-ik Orien-tal-ist-in Orient-ier-ung Orient-ier-ungen Orient-ier-ungs-hilf-e Orient-ier-ungs-hilf-en Orient-ier-ungs-los-igkeit Orient-ier-ungs-punkt Orient-ier-ungs-punkt-e Orient-ier-ungs-stuf-e Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-mann Ver-trau-ens-sache Ver-trau-ensvorschuß Ver-trau-ensvo-tum Ver-trau-ens-würd-igkeit Ver-traut-es Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver-triebenenverbände Ver-triebs-beleg-e

5. Further research • Examine quality on various language types • Improve trie-based classificator • Possibly combine with other existing algorithms • Find out how to acquire morphology of non-concatenative languages • Deeper analysis: • find deletions • alternations • insertions • morpheme classes etc.

6. References • (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag. • (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001. • (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. • (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998. • (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.

6. References II • (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001. • (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974. • (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955. • (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997. • (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002. • (Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.

E. Gender-e vs. Frequency-e vs. other-e andere 8.4 keine 6.8 rote 11.6 stolze 8.0 drehte 10.8 winzige 9.7 lustige 13.2 rufe 4.4 Dumme 12.6 vs. Gender-e Schule 8.4 Devise 7.8 Sonne 4.5 Abendsonne 5.3 Abende 5.5 Liste 6.5 Frequency-e Affe 2.7 Junge 5.3 Knabe 4.6 Bursche 2.4 Backstage 3.0

Unsupervised Knowledge-Free Morpheme Boundary Detection