1 / 21

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition. Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006. Roadmap. Bakeoff Task Motivation Bakeoff Structure: Materials and annotations Tasks and conditions Participants and timeline

Download Presentation

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006

  2. Roadmap • Bakeoff Task Motivation • Bakeoff Structure: • Materials and annotations • Tasks and conditions • Participants and timeline • Results & Discussion: • Word Segmentation • Named Entity Recognition • Observations & Conclusions • Thanks

  3. Bakeoff Task Motivation • Core enabling technologies for Chinese language processing • Word segmentation (WS) • Crucial tokenization in absence of whitespace • Supports POS tagging, parsing, ref. resolution, etc • Fundamental challenges: • “Word” not well, consistently defined; humans disagree • Unknown words impede performance • Named Entity Recognition (NER) • Essential for reference resolution, IR, etc • Common class of new unknown words

  4. Data Source Characterization • Five corpora, providers • Annotation guidelines available, varied • Simplified and traditional characters • Range of encodings, all available in Unicode (UTF-8) • Provided in common XML, converted to train/test form (LDC)

  5. Tasks and Tracks • Tasks: • Word Segmentation: • Training and truth: whitespace delimited • End-of-word tags replaced with space, no others • Named Entity Recognition: • Training and truth: Similar to Co-NLL 2-column • NAMEX only: LOC, PER, ORG (LDC: +GPE) • Tracks: • Closed: Only provided materials may be used • Open: Any materials may be used, but must document

  6. Structure: Participants &Timeline • Participants: • 29 sites submitted runs for evaluation (36 init) • 144 runs submitted: ~2/3 WS; 1/3 NER • Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan, 1each: Singapore, Korea, Hong Kong, Canada • Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom, etc- and Academic sites • Timeline: • March 15: Registration open • April 17: Training data released • May 15: Test data released • May 17: Results due

  7. Word Segmentation: Results • Contrasts: Left-to-right maximal match • Baseline: Uses only training vocabulary • Topline: Uses only testing vocabulary

  8. Word Segmentation: CityU CityU Closed CityU Open

  9. Word Segmentation: CKIP CKIP Closed CKIP Open

  10. Word Segmentation: MSRA MSRA Closed MSRA Open

  11. Word Segmentation: UPUC UPUC Closed UPUC Open

  12. Word Segmentation: Overview • F-scores: 0.481-0.797 • Best score: MSRA Open Task (FR Telecom) • Best relative to topline: CityU Open: >99% • Most frequent top rank: MSRA • Both F-scores and OOV recall higher in Open • Overall good results: Most outperform baseline

  13. Word Segmentation: Discussion • Continuing OOV challenges • Highest F-scores on MSRA • Also highest top and base lines • Lowest OOV rate • Lowest F-scores on UPUC • Also lowest top and baselines • Highest OOV rate (> double all other OOV) • Smallest corpus (~1/3 MSRA) • Best scores: most consistent corpus • Vocabulary, annotation • UPUC also varies in genre: train: CTB; test: CTB,NW,BN

  14. NER Results • Contrast: Baseline • Label as Named Entity if unique tag in training

  15. NER Results: CityU CityU Closed CityU Open

  16. NER Results: LDC LDC Closed LDC Open

  17. NER Results: MSRA MSRA Closed MSRA Open

  18. NER: Overview • Overall results: • Best F-score: MSRA Open Track: 0.91 • Strong overall performance: • Only two results below baseline • Direct comparison of NER Open vs Closed • Difficult: only two sites performed both tracks • Only MSRA had large numbers of runs • Here Open outperformed Closed: top 3 Open > Closed

  19. NER Observations • Named Entity Recognition challenges • Tagsets, variation, and corpus size • Results on MSRA/CityU much better than LDC • LDC corpus substantially smaller • Also larger tagset: GPE • GPE easily confused for ORG or LOC • NER results sensitive to corpus size, tagset, genre

  20. Conclusions & Future Challenges • Strong, diverse participation in WS & NER • Many effective competitive results • Cross-task, cross-evaluation comparisons • Still difficult • Scores sensitive to corpus size, annotation consistency, tagset, genre, etc • Need corpus, config-independent measure of progress • Encourage submissions that support comparisons • Extrinsic, task-oriented evaluation of WS/NER • Continuing challenges: OOV, annotation consistency, encoding combinations and variation, code-switching

  21. Thanks • Data Providers: • Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan: • Keh-Jiann Chen, Henning Chiu • City University of Hong Kong: • Benjamin K.Tsou, Olivia Oi Yee Kwong • Linguistic Data Consortium: Stephanie Strassel • Microsoft Research Asia: Mu Li • University of Pennsylvania/University of Colorado: • Martha Palmer, Nianwen Xue • Workshop co-chairs: • Hwee Tou Ng and Olivia Oi Yee Kwong • All participants!

More Related