1 / 32

Rethinking Chinese Word Segmentation:

Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. 黃居仁 Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm April 11, 2007,Hong Kong Polytechnic University. Citation.

velma
Download Presentation

Rethinking Chinese Word Segmentation:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm April 11, 2007,Hong Kong Polytechnic University

  2. Citation • Please note that this is our ongoing work that will be presented later as Chu-Ren Huang, Petr Šimon, Shu-Kai Hsieh and Laurent Prévot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. To appear in the proceedings of the 2007 ACL Annual Meeting.

  3. Outline • Introduction: modeling and theoretical challenges • Previous Models • Segmentation as Tokenization • Character classification model • A radical model • Implementation and experiment • Conclusion/Implications

  4. Introduction: modeling and theoretical challenges • Back to the basics: The goal of Chinese word segmentation is to identify wordbreaks • Such that these segmented units can be used as processing units (i.e. words) • Crucially • Words are not identified before segmentation • Wordbreaks in Chinese fall at character-breaks only, and at no other places

  5. Challenge I Segmentation is the pre-requisite task for all Chinese processing applications, hence a realistic solution of segmentation must be • Robust: perform consistently regardless of language variations • Scalable: be applicable to all variants of Chinese and requires minimal training • Portable: applicable for real time processing to all kinds of texts, all the time,

  6. Challenge II Chinese speakers perform segmentation subconsciously without mistakes, hence if we simulate human segmentation, it must : • Be Robust, Sharable, Portable • Not assume prior lexical knowledge • Equally sensitive to known and unknown words

  7. So Far Not so good • All exiting algorithms perform reasonably well but require • Large set of training data • Long training time • Comprehensive lexicon • And the training process must be repeated with every new variant (topic/style/genre) But Why?

  8. Previous Models ISegmentation as Tokenization The Classical Model (Chen and Liu 1992 etc.) • Segmentation is interpreted as identification of tokens (e.g. words) in a text, hence contains two steps • Dictionary Lookup • Unknown Word (or OOV) Resolution

  9. Segmentation as Tokenization 2 • Find all sequences Ci, …Ci+m such that [Ci, …Ci+m] is a token iff • it is an entry in the lexicon, or • It not a lexical entry but is predicted to be so by a unknown word resolution algorithm • Ambiguity Resolution: when there is a Cj, such that both [x, Cj, y] and [y, Cj, z] are entries in the lexicon

  10. Segmentation as Tokenization 3 • High Complexity: • mapping tens of thousand of lexical entries to even more possible matching strings • Overlapping ambiguity estimated to be up to 20% depending on texts and lexica • Not Robust • Dependent on lexicon (and lexica are notoriously easy to change and expensive to build • OOV?

  11. Previous Models II:Character Classification Currently Popular Model (Xue 2003, Gao et al. 2004) • Segmentation is re-interpreted as classification of character positions. • Classify and tag each character according to its position in a word (initial, final, middle etc.) • Learn the distribution of such classification from a corpus • Predict segmentation based on positional classification of a character in a string

  12. Character Classification 2 • Character Classification: • Each character Ci is associated with a 3-tuple Ci: <Inii, Midi, Fin i> where Inii, Midi, Fini are the probability for Ci, to be in Initial, Middle, or Final positions respectively. • Ambiguity Resolution: • Multiple classification of a character: A character does not occur exclusively as initial or final etc. • Conflicting classifications of neighboring characters.

  13. Character Classification 3 • Less Complexity: • 6,000 characters x 3 to 10 positional classes • Higher Performance: 97% f-score on SigHAN bakeoff (Huang and Zhao 2006)

  14. Character Classification 4 Inherent Modeling Problems • Segmentation becomes a second order decision dependent on first order decision on character classification • Unnecessary complexity involved • Inherent ceiling set (segmentation cannot outperform character classification) • Still highly dependent on lexicon • Character positions must be defined with prior lexical knowledge of a word

  15. Our New Proposal Naïve but Radical • Segmentation is nothing but segmentation • Possible segmentation sites are well-defined without ambiguity. They are simply the character-breaks clearly marked in any text. • The task is simply to identify all CB which also function as Wordbreak (WB) • Based on distributional information extracted from the contexts surrounding CB’s (i.e. characters)

  16. Simple Formalization • Any Chinese text is envisioned as a sequence characters-breaks CB’s, evenly distributed among a sequence of characters c’s. CB0c1CB1c2...CBi-1 ciCBi...CBn-1 cnCBn • NB: Psycholinguistic experiment with eye-tracking machine shows that eyes can fix on edges of a character when reading Chinese. (J.L. Tsai, p.c.)

  17. How to Model Distributional Information of blanks? • There is no overt difference between CB’s and WB’s. Unlike English, where the CB spaces are small, but the WB spaces are BIG. • Hence distributional information must come from the context. • CB0c1CB1c2...CBi-1 ciCBi...CBn-1 cnCBn • Overtly, CB’s carry no distributional Info. • However, c’s do carry information about the status of a CB/WB in its neighborhood (based on a tagged corpus, or human experience)

  18. Range of Relevant Context CBi-2 CBi-1 ciCBi+1 CBi+2 • Recall that CB’s carry no overt information, while c’s do. • Linguistically, it is attested that initial, final, second, and penultimate positions are morphologically significant. • In other words, a linguistic element can carry explicit information about immediately adjacent CB’s as well the CB’s immediately adjacent to the above two • 2CB-Model: Taking all the immediate ones • 4CB-Model: Taking two more

  19. Collecting Distributional Information CBi-2 CBi-1 ciCBi+1 CBi+2 • Adopt either 2CBM or 4CBM • Collect a 2-tuple or 4-tuple for each character from a segmented corpus • Sum up the n-tuple value for all tokens belong to the same character type to form a distributional vector Table 2. Character table for 4CBM

  20. Estimating Distributional Features of CB’s c-2 c-1CBc+1 c+2 • For each CB, distributional information is contributed by 2 or 4 adjacent characters • Each characters carry the four-element vector given above, align the vector positions and then sum up • Note that no knowledge from a lexicon is involved (while the character classification model is making explicit decision of the position of that character in a word)

  21. Aligning Vector Positions c-2 c-1CBc+1 c+2 c-2< V1, V2, V3, V4 > c-1< V1, V2, V3, V4 > c+1< V1, V2, V3, V4 > c+2<V1, V2, V3, V4 >

  22. Theoretical Issues in Modeling • Do we look beyond WB’s (in 4CBM)? • No, characters cannot contribute to boundary conditions beyond an existing boundary. • Yes, we cannot assume lexical knowledge a priori (and the model is more elegant) • One or Two features (in 4CBM)? • No, positive information (that there is a WB) and negative (that there is no WB) should be complimentary • Yes (especially when the answer to the above Q is no), since there are under-specified cases

  23. Size of Distributional Info • The Sinica Corpus 5.0 contains 6820 types of c’s (characters, numbers, punctuation, Latin alphabet etc.) • The 10 million word corpus is converted into 14.4 million labeled CB vectors. • In this first study we implement a CB only model, without any preprocessing of punctuation marks.

  24. How to Model Decision I • Assuming that each character represents an independent event, hence all relevant vectors can be summed up and evaluated • Simple heuristic by sum and threshold • Decision Tree trained on segmented corpus • Machine-learning trained on segmented corpus?

  25. Simple Sum and Threshold Heuristic • Mean for sums of CB vectors for each S and -S (mean probability of S = 2.90445651112, -S = 1.89855870063) • One standard deviation difference between each CB vector and threshold values was used as a segmentation heuristics • 88% accuracy • Error analysis: CB vectors are not linearly separable

  26. Decision Tree • A decision tree classifier (YaDT, Ruggieri2004) is adopted • on a 900,000 CB vectors sample of 100,000 boundary vectors for testing phase. • Achieves up to 97% accuracy in inside test, including numbers, punctuation and foreign words.

  27. Evaluation: SigHAN Bakeoff • Note that our method is NOT designed for SigHAN bakeoff, where resources are devoted to fine-tune for the small extra edge in scoring • This radical model aims to be robust in a real world situation, where it can perform reliably without extra tuning when encountered different texts • No manual pre-processing, texts input as seen

  28. Evaluation • Closed test, but without any lexical knowledge

  29. Discussion • The method is basically sound • We still need to develop an effective algorithm for adaptation to new variants • Automatic pre-processing on punctuation marks and foreign symbols should improve the performance • What role should lexical knowledge play? • The character as independent event assumption may be incorrect

  30. How to Model Decision II • Assuming that a string of characters are not independent events, hence certain combinations (as well as single characters) can contribute to WB decision. • One possible implementation: c’s as committee members, decision by vote • Five voting blocks by simple majority: c-2 c-1, c-1, c+1 c-1, c+1, c+1 c+2 c-2 c-1CBc+1 c+2

  31. Conclusion I • We propose a radical but elegant model for Chinese Word Segmentation • Where the task is reduce to binary classification of CB’s into WB’s and non WB’s • The model does not pre-suppose and lexical knowledge and relies only on distributional information of characters as the context for CB’s

  32. Conclusion II • In principle, this model should be robust and scalable for all different variants of texts • Preliminary experiment result is promising yet leave rooms for improvement • Work is still on-going • You are welcomed to adopt this model and experiment with your favorite algorithm!

More Related