ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrateageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate Aminul Islam, Diana Inkpen and Iluju Kiringa

a generalized approach to word segmentation using maximum length descending frequency and entropy rate Aminul Islam, Diana Inkpen and Iluju Kiringa

Outline • Word Segmentation • Applications of Word Segmentation • Categories of Word Segmentation • A Walk-Through Example • Evaluation and Experimental Results • Conclusion and Future Work • References

Word Segmentation • We differentiate the terms word breaking and word segmentation • Word breakingrefers to the process of segmenting known words predefined in a lexicon • Word segmentationrefers to the process of both lexicon word segmentation and unknown word or new word detection • the greatest barrier to word segmentation is in recognizing unknown words, words not in the lexicon of the segmenter. • Word segmentation is an important problem in many NLP tasks

Applications of Word Segmentation • solving semantic heterogeneity in databases [Madhavan et al. 2005] (e.g., custid, maxprice) • In speech recognition (where there is no explicit word boundary information given) • interpreting oriental written languages • extracting words from a scanned document page

Categories of Word Segmentation • Dictionary-based - only words that are stored in the dictionary can be identified • Corpus-based - covers a huge collection of both domain-dependent and domain-independent words • Hybrid approaches

A Walk-Through Example S = themostfavouritemusicofalltime S'={the, most, favourite,music, of, all, time}

A Walk-Through Example favourite S = themostfavouritemusicofalltime favourite Mf= {α1, α2…, αn}, where αx is the minimum frequency required to be a valid word of length x Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = themost musicofalltime alltime S'={favourite} Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = themost musicofalltime favour S'={favourite}

A Walk-Through Example S = themost musicofalltime S'={favourite} musico Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = themost musicofalltime music S'={favourite} music Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = themost ofalltime S'={favourite,music} vouri

A Walk-Through Example them S = themost ofalltime S'={favourite,music} them Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = ost ofalltime time S'={them, favourite,music} time Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = ost ofall S'={them, favourite,music, time} most

A Walk-Through Example S = ost ofall fall S'={them, favourite,music, time} fall Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} ost Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2} o

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {}← leftMaxMatching(themostfavourite) them

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {}← leftMaxMatching(themostfavourite)

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them}← leftMaxMatching(ostfavourite)

A Walk-Through Example S = ost o ost S'={them, favourite,music, fall, time} {them}← leftMaxMatching(ostfavourite)

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} os {them}← leftMaxMatching(ostfavourite) os

A Walk-Through Example S = ost o tf S'={them, favourite,music, fall, time} {them, os}← leftMaxMatching(tfavourite)

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} t {them, os}← leftMaxMatching(tfavourite)

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t}← leftMaxMatching(favourite)

A Walk-Through Example favourite S = ost o S'={them, favourite,music, fall, time} {them, os, t}← leftMaxMatching(favourite) favourite

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching()

A Walk-Through Example favourite S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {}← rightMaxMatching(themostfavourite) favourite

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {favourite}← rightMaxMatching(themost)

A Walk-Through Example most S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {favourite}← rightMaxMatching(themost) most

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {most, favourite}← rightMaxMatching(the)

A Walk-Through Example the S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {most, favourite}← rightMaxMatching(the) the

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching()

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall) as >

A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

A Walk-Through Example S = o S'={the, most, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

A Walk-Through Example S = S'={the, most, favourite,music, of, all, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

A Walk-Through Example S = themostfavouritemusicofalltime S'={the, most, favourite,music, of, all, time}

Proposed Word Segmentation Method • Uses corpus type frequency to choose the type with maximum length and frequency from ‘desegmented’ text • Uses a modified forward-backward matching using maximum length frequency for residue • If forward-backward matching using maximum length frequency returns same numbers of words then we use entropy rate to decide which set of words we will accept

We used type frequency from the BNC We tested on the Brown corpus as well as on the BNC We removed all spaces from the corpus The performance is measured using Precision, Recall and F-measure Evaluation and Experimental Results

Evaluation and Experimental Results Figure: Test result on the Brown corpus

The method can effectively distill special terms and proper nouns The proposed method segments words with high precision and recall Top choices search engines segment the ‘desegmented’ part from a search text only if the ‘desegmented’ part contains two to three words Future directions involve integrating the current algorithm into a larger system for comprehensive and context-based word analysis Conclusion and Future Work

References • Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, (1999) 71–106 • Dale, R., Moisl, H. and Somers, H.: Handbook of Natural Language Processing. Marcel Dekker, Inc. New York (2000) 22-26 • Do, H.H. and Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In VLDB (2002) • Gao, J., Li, M., Wu, A. and Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics, 31(4) (2005) • Hua, Y.: Unsupervised word induction using MDL criterion. In Proceedings ISCSL2000, Beijing (2000) • Madhavan, J., Bernstein, P., Doan, A. and Halevy, A.: Corpus-based Schema Matching. In International Conference on Data Engineering (ICDE-05) (2005) • Peng, F. and Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation, In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan. (2001) 475-480

Thanks

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate

Presentation Transcript