1 / 23

Re-organization of IR/CSC team

Re-organization of IR/CSC team. Hongchao He Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Guihong Cao MSKK-III – Clustering for technique transfer Yang Wen MSKK-III – Distance word dependency Min Zhang

stephan
Download Presentation

Re-organization of IR/CSC team

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Re-organization of IR/CSC team • Hongchao He • Conf. follow up TREC-10, NTCIR • Paper follow up ICCLP, SIGIR paper • Guihong Cao • MSKK-III – Clustering for technique transfer • Yang Wen • MSKK-III – Distance word dependency • Min Zhang • MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system

  2. Chinese Spelling Checking(or, the Big CSC) Jianfeng Gao NLC Group, MSRCN

  3. Outline • Introduction • Chinese spelling checking • Our approach • Key techniques and experiments • Millstone

  4. Introduction Goal:Automatically correct Chinese spelling errors using MS-Pinyin (MSPY) input system • Chinese spelling errors using MS-Pinyin input system • Chinese spelling error patterns • English spelling checking • Why CSC is difficult?

  5. Chinese spelling errors using MSPY Text in the brain Pinyin (phonetic) errors Syllable Typographic errors Key stroke (Typing) System errors Converted text

  6. Chinese spelling errors patterns • Substitution errors • Pinyin error • System error (include Pinyin error in some systems) • Non-substitution errors  word segmentation errors • Typographic errors – insertion/deletion/transposition

  7. English spelling checking • Non-word error detection (“the”  “hte”) • N-gram (letter) analysis • Dictionary lookup • Real-word error detection (“from”  “form”) • NLP – parser driven • Statistical approach – data/error driven • Local – n-gram language model, depend on pre-defined confusion set • Global – Winnow, Bayesian, TBL, etc. • Problem – lack of error detection

  8. Why CSC is difficult? • Word segmentation • Ambiguous • OOV – Proper noun detection (personal name, location, organization, etc.) • Segmentation error propagation • Non-word errors (in sense of English) do not exist • MSPY makes good use of word trigram language model

  9. Chinese spelling checking • CSC – related works • Template matching – long distance, e.g. <之所以> <是因为> • Pattern matching – long words (n>=3), e.g. 一文不明  一文不名, 忠耿耿  忠心耿耿 • N-gram models – substitution errors • CSC – challenges • Long distance, coverage issue of template/pattern set • High-frequent-used confusion set, e.g. {像,象} {在,再} • OOV, especially the proper nouns • N-gram, has been fully used by MSPY

  10. Chinese spelling errors patterns in MSPY • Proper noun • Personal name • Location • organization • Non-word errors: context independent • Insertion/deletion/transposition/substitution • E.g. 一文不明  一文不名, 忠耿耿  忠心耿耿 • Real-word errors: context sensitive • E.g. 像  象, 在  再, 实施  事实

  11. Flowchart of our approach Text with errors Proper noun detection Word segmentation Word fuzzy matching Trigger: single char string , low prob Non-word error correction Context sensitive disambiguation Real-word error correction

  12. Word segmentation and proper noun detection • Language model based word segmentation • Class-based language model • P(W) = Poutside(W) Pinsidea(W|<PN>), a = ? • Outside probability – PN tagged training data • Using NLPWIN to tag the corpus • Filtering, rule base • EM? • Inside probability – PN list training data • Using cache (or, dynamic dictionary)

  13. Experiments and Findings • Measure: precision/recall – definition • Training data – People Daily • Tag tool – NLPWIN • Test data – spec. • Results and Findings

  14. Long word fuzzy matching • Definition of Distance(s1, s2) • Long word, n>=3, • Sum of delete/insert/substitute a character • Fast fuzzy matching • Global – Lei Zhang’s ACL • Local – trigger, (single char, or low n-gram probability ) • Search – error detection/correction • Viterbi • Simplified version • Long word + Local matching

  15. Experiments and Findings • Contact: 100 person, 3000 -- 5000 characters/person • Error analysis • Algorithm … • Measure: precision/recall • Large lexicon, acquisition. • Trigger/threshold ? • Results and Findings

  16. Context sensitive disambiguation • Building confusion set – specific to MSPY • Feature selection – Context vector • Collocation – contiguous POS or words/characters • Context words – words/characters within a K-size window • Triple ? • Weighting schema and Classifier • Context Vector, TFIDF • Winnow, Bayesian, TBL, etc. • Scaling up • Enlarge confusion set • Feature pruning • Adaptation

  17. Experiments and Findings • Measure: precision/recall • Training data • Test data (XXX confusion set) • Results and Findings

  18. Experiments and Findings • Current Work • Pseudo-training set based on MSPY IME • Preliminary data processing (400M PD) • Unigram error model (10,000 Words useful) • 使 是/69484 市/10289 诗/2394 …… • Trigram error pattern (980,000 useful) • 共[度]难关=>渡 / 不够[英],=>硬 • Experiments based on basic approaches • Pseudo-test set from 南方周末 • Continuous pair (Recall = 50%, Precision = 25%) • Pattern Matching (??) • Future Work • Hybrid approaches • Pattern Clustering + Continuous pair • Functional words error detection

  19. System evaluation – put it all together • Evaluation toolset • Measure: precision/recall • Training data • Test data • Results and Findings

  20. Prototype • Demo … • Online & offline CSC • Right click • Spelling error detection/correction • Proper noun detection/correction

  21. Assignment • Jianfeng Gao – overall, fuzzy matching • Mu Li – context sensitive disambiguation • Jian Sun – PN detection • Yang Wen – system evaluation • Yulin Kang – demo • Lei Zhang – senior consultant

  22. Millstone • Oct. 2001, Ming says “Yes” (TAB demo) • Dec. 2001, Dong says “Yes” (Transfer) • Aug. 2002, HJ says “Yes” (Party)

  23. Information • Access at \\msrcn4p3\rootD\gaojf\spell • Contact me if any problems • Jianfeng Gao, Tel: 86-10-62617711-5778, Email: jfgao@microsoft.com

More Related