1 / 47

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate. Aminul Islam, Diana Inkpen and Iluju Kiringa. a generalized approach to word segmentation using maximum length descending frequency and entropy rate. Aminul Islam, Diana Inkpen and Iluju Kiringa. Outline.

Download Presentation

ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrateageneralizedapproachtowordsegmentationusingmaximumlengthdescendingfrequencyandentropyrate Aminul Islam, Diana Inkpen and Iluju Kiringa

  2. a generalized approach to word segmentation using maximum length descending frequency and entropy rate Aminul Islam, Diana Inkpen and Iluju Kiringa

  3. Outline • Word Segmentation • Applications of Word Segmentation • Categories of Word Segmentation • A Walk-Through Example • Evaluation and Experimental Results • Conclusion and Future Work • References

  4. Word Segmentation • We differentiate the terms word breaking and word segmentation • Word breakingrefers to the process of segmenting known words predefined in a lexicon • Word segmentationrefers to the process of both lexicon word segmentation and unknown word or new word detection • the greatest barrier to word segmentation is in recognizing unknown words, words not in the lexicon of the segmenter. • Word segmentation is an important problem in many NLP tasks

  5. Applications of Word Segmentation • solving semantic heterogeneity in databases [Madhavan et al. 2005] (e.g., custid, maxprice) • In speech recognition (where there is no explicit word boundary information given) • interpreting oriental written languages • extracting words from a scanned document page

  6. Categories of Word Segmentation • Dictionary-based - only words that are stored in the dictionary can be identified • Corpus-based - covers a huge collection of both domain-dependent and domain-independent words • Hybrid approaches

  7. A Walk-Through Example S = themostfavouritemusicofalltime S'={the, most, favourite,music, of, all, time}

  8. A Walk-Through Example favourite S = themostfavouritemusicofalltime favourite Mf= {α1, α2…, αn}, where αx is the minimum frequency required to be a valid word of length x Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  9. A Walk-Through Example S = themost musicofalltime alltime S'={favourite} Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  10. A Walk-Through Example S = themost musicofalltime favour S'={favourite}

  11. A Walk-Through Example S = themost musicofalltime S'={favourite} musico Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  12. A Walk-Through Example S = themost musicofalltime music S'={favourite} music Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  13. A Walk-Through Example S = themost ofalltime S'={favourite,music} vouri

  14. A Walk-Through Example them S = themost ofalltime S'={favourite,music} them Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  15. A Walk-Through Example S = ost ofalltime time S'={them, favourite,music} time Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  16. A Walk-Through Example S = ost ofall S'={them, favourite,music, time} most

  17. A Walk-Through Example S = ost ofall fall S'={them, favourite,music, time} fall Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  18. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} ost Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2}

  19. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} Mf= {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2} o

  20. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {}← leftMaxMatching(themostfavourite) them

  21. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {}← leftMaxMatching(themostfavourite)

  22. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them}← leftMaxMatching(ostfavourite)

  23. A Walk-Through Example S = ost o ost S'={them, favourite,music, fall, time} {them}← leftMaxMatching(ostfavourite)

  24. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} os {them}← leftMaxMatching(ostfavourite) os

  25. A Walk-Through Example S = ost o tf S'={them, favourite,music, fall, time} {them, os}← leftMaxMatching(tfavourite)

  26. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} t {them, os}← leftMaxMatching(tfavourite)

  27. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t}← leftMaxMatching(favourite)

  28. A Walk-Through Example favourite S = ost o S'={them, favourite,music, fall, time} {them, os, t}← leftMaxMatching(favourite) favourite

  29. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching()

  30. A Walk-Through Example favourite S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {}← rightMaxMatching(themostfavourite) favourite

  31. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {favourite}← rightMaxMatching(themost)

  32. A Walk-Through Example most S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {favourite}← rightMaxMatching(themost) most

  33. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {most, favourite}← rightMaxMatching(the)

  34. A Walk-Through Example the S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {most, favourite}← rightMaxMatching(the) the

  35. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching()

  36. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall) as >

  37. A Walk-Through Example S = ost o S'={them, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

  38. A Walk-Through Example S = o S'={the, most, favourite,music, fall, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

  39. A Walk-Through Example S = S'={the, most, favourite,music, of, all, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

  40. A Walk-Through Example S = S'={the, most, favourite,music, of, all, time} {them, os, t, favourite}← leftMaxMatching() {the, most, favourite}← rightMaxMatching() {music, of, all}← leftMaxMatching( musicofall) {mus, ico, fall}←rightMaxMatching( musicofall)

  41. A Walk-Through Example S = themostfavouritemusicofalltime S'={the, most, favourite,music, of, all, time}

  42. Proposed Word Segmentation Method • Uses corpus type frequency to choose the type with maximum length and frequency from ‘desegmented’ text • Uses a modified forward-backward matching using maximum length frequency for residue • If forward-backward matching using maximum length frequency returns same numbers of words then we use entropy rate to decide which set of words we will accept

  43. We used type frequency from the BNC We tested on the Brown corpus as well as on the BNC We removed all spaces from the corpus The performance is measured using Precision, Recall and F-measure Evaluation and Experimental Results

  44. Evaluation and Experimental Results Figure: Test result on the Brown corpus

  45. The method can effectively distill special terms and proper nouns The proposed method segments words with high precision and recall Top choices search engines segment the ‘desegmented’ part from a search text only if the ‘desegmented’ part contains two to three words Future directions involve integrating the current algorithm into a larger system for comprehensive and context-based word analysis Conclusion and Future Work

  46. References • Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, (1999) 71–106 • Dale, R., Moisl, H. and Somers, H.: Handbook of Natural Language Processing. Marcel Dekker, Inc. New York (2000) 22-26 • Do, H.H. and Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In VLDB (2002) • Gao, J., Li, M., Wu, A. and Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics, 31(4) (2005) • Hua, Y.: Unsupervised word induction using MDL criterion. In Proceedings ISCSL2000, Beijing (2000) • Madhavan, J., Bernstein, P., Doan, A. and Halevy, A.: Corpus-based Schema Matching. In International Conference on Data Engineering (ICDE-05) (2005) • Peng, F. and Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation, In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan. (2001) 475-480

  47. Thanks

More Related