1 / 37

Breaking the Resource Bottleneck for Multilingual Parsing

This research paper explores how to quickly and automatically induce a non-English language treebank using available English resources. The framework includes a direct projection algorithm, post-projection transformation, filtering techniques, and evaluation experiments.

lamons
Download Presentation

Breaking the Resource Bottleneck for Multilingual Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland

  2. The Treebank Bottleneck • High-quality parsers need training examples with hand-annotated syntactic information • Annotation is labor intensive and time consuming • There is no sizable treebank for most languages other than English [[S [NP-SBJFord Motor Co.] [VPacquired[NP [NP5 %] [PPof[NP [NPthe shares] [PP-LOCin[NPJaguar PLC]]]]]]].]

  3. State of the Art Parsing

  4. Research Questions • How can we induce a non-English language treebank quickly and automatically? • Bootstrap from available English resources • Project syntactic dependency relationship across bilingual sentences • How good is the resulting treebank? • Can we use it to train a new parser? • How can we improve its quality?

  5. Roadmap • Overview of the framework • Direct projection algorithm • Problematic cases • Post projection transformation • Remaining challenges • Filtering • Experiment • Direct evaluation of the projected trees • Evaluation of a Chinese parser trained on the induced treebank • Future Work

  6. Overview of Our Framework bilingual corpus English Chinese projected Chinese dependency treebank English dependency parser word alignment model train unseen Chinese sentences dependency parser Projection Transformation dependency trees for unseen sentences Filtering

  7. Necessary Resources: 1. Bilingual Sentences The Chinese side expressed satisfaction regarding this subject 中国 方面 对 此 表示 满意

  8. Necessary Resources 2. English (Dependency) Parser det mod mod adj subj obj det The Chinese side expressed satisfaction regarding this subject 中国 方面 对 此 表示 满意

  9. Necessary Resources 3. Word Alignment det mod mod adj subj obj det The Chinese side expressed satisfaction regarding this subject 中国 方面 对 此 表示 满意

  10. Projected Chinese Dependency Tree det mod mod adj subj obj det The Chinese side expressed satisfaction regarding this subject 中国 方面 对 此 表示 满意 mod mod obj adj subj

  11. Direct Projection Algorithm • If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words

  12. Problematic Case: Unaligned English mod det regarding this subject 对 此

  13. Problematic Case: Unaligned English mod det regarding this subject 对 此 *e* det mod

  14. Problematic Case: many-to-1 mod det regarding this subject 对 此

  15. Problematic Case: many-to-1 mod det regarding this subject 对 此 mod

  16. Problematic Case: Unaligned Chinese det subj The Chinese *e* expressed 中国 方面 表示 *e*

  17. Problematic Case: Unaligned Chinese det subj The Chinese *e* expressed 中国 方面 表示 *e* det subj

  18. Problematic Case: 1-to-many det subj The Chinese expressed 中国 方面 表示 *e*

  19. Problematic Case: 1-to-many det subj The Chinese expressed 中国 方面 表示 *e* *M* mac det mac subj

  20. Output of the Direct Projection Algorithm mod mod det subj obj det expressed satisfaction regarding this subject The Chinese 中国 方面 对 此 表示 满意 *e* *M* mod mac mod det obj mac subj

  21. Post Projection Transformation • Handles One-to-Many mapping • Select head based on (projected) part-of-speech categories • Handles some Unaligned-Chinese cases • Only addressing close-class words • Functional words (e.g., aspectual, measure words) • Easily enumerable lexical categories (e.g., $, RMB, yen) • Remove empty nodes introduced by the Unaligned-English cases by promoting its head child

  22. Remaining Challenges • Handling divergences • Incorporating unaligned foreign words into the projected tree • Removing cross dependencies A B C D a d b c

  23. Filtering • Projected treebank is noisy • Mistakes introduced by the projection algorithm • Mistakes introduced by component errors • Use aggressive filtering techniques to remove the worst projected trees • Filter out a sentence pair if many English words were unaligned • Filter out a sentence pair if many Chinese words are aligned to the same English word • Filter out a sentence pair if many of the projected links caused crossing dependencies

  24. Experiments • Direct evaluation of the projection framework • Compare the (pre-filtered) projected trees against human annotated gold standard • Evaluation of the projected treebank • Use the (post-filtered) treebank to train a Chinese parser • Test the parser on unseen sentences and compare the output to human annotated gold standard

  25. Direct Evaluation • Bilingual data: 88 Chinese Treebank sentences with their English translations • Apply projection and transformation under idealized conditions • Given human-corrected English parse trees and hand-drawn word-alignments • Apply projection and transformation under realistic conditions • English parse trees generated from Collins parser (trained on Penn Treebank) • Word-alignments generated from IBM MT Model (trained on ~56K Hong Kong News bilingual sentences)

  26. Direct Evaluation Results *Accuracy = f-score based on unlabeled precision & recall

  27. Evaluating Trained Parser • Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus • Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank • Filter out badly-aligned sentence pairs to reduce noise • Train a Chinese parser with the (filtered) projected treebank • Test the Chinese parser on unseen test set (88 Chinese Treebank sentences)

  28. Parser Evaluation Results

  29. Conclusion • We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources • Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge • A parser trained on the induced treebank can outperform some baselines

  30. Future Work • Obtain larger parallel corpus • Reduce error rates of the word-alignment models • Develop more sophisticated techniques to filter out noise in the induced treebank • Improve the projection algorithm to handle unaligned words and inconsistent trees

  31. Reserve slides

  32. DPA Case 1: One-to-One A B b a

  33. DPA Case 2: Many-to-One C A1 A2 A3 B c a b

  34. DPA Case 3: One-to-Many A B b *a* a1 a2 a3

  35. DPA Case 4: Many-to-Many C A1 A2 A3 B c *a* a1 a2 b

  36. DPA Case 5: Unaligned English Word A B C a *b* c

  37. DPA Case 6: Unaligned Foreign Word A *B* C a b c

More Related