1 / 20

Collocation Extraction Using Monolingual Word Alignment Method

Collocation Extraction Using Monolingual Word Alignment Method. Zhanyi Liu, Haifeng Wang, Hua Wu, Sheng Li EMNLP 2009. Collocation. Two words Consecutive ("by accident") Interrupted ("take ... advice") Other examples Proper noun ("New York") Compound nouns ("ice cream")

Download Presentation

Collocation Extraction Using Monolingual Word Alignment Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collocation Extraction Using Monolingual Word Alignment Method Zhanyi Liu, Haifeng Wang, Hua Wu, Sheng Li EMNLP 2009

  2. Collocation • Two words • Consecutive ("by accident") • Interrupted ("take ... advice") • Other examples • Proper noun ("New York") • Compound nouns ("ice cream") • Correlative conjunction ("either ... or")

  3. Previous Works • Co-occurring word pairs • Word pairs in a given window size • Association measures • Frequency, log-likelihood, mutual information ... • Disadvantage • Long-span collocation • "either ... or", "because ... so" • Limited by window size • False collocation • Any word pairs in window size

  4. Monolingual Word Alignment • Bilingual word alignment (BWA) • Source-target sentence pairs • Monolingual Word Alignment (MWA) • Source-source sentence pairs • Replicate the corpus

  5. Monolingual Word Alignment (2) Bilingual Monolingual A word never collocates with itself

  6. MWA Model • Sentence with l words S ={w1,...,wl} • Alignment A = {(i,ai) | i∈[1,l]} A = {(2,3), (3,2), (4,7), (6,7)...}

  7. MWA Model (2) • Adapt IBM Model 3 to MWA • EM training algorithm, produce 3 probability • Word collocation probability • Position collocation probability • d(4|7,12) • Prob that 4th collocates with 7th word in a 12-word sentence • Fertility probability • Prob that wi is collocate with Φi words

  8. Collocation Extraction • Extract and rank. Filter when freq(wi,wj)<5 • Symmetric assumption • (wi, wj) = (wj, wi)

  9. Initial Experiment • Chinese • Training data • LDC2007T03 Tagged Chinese Giga Word • Xinhua portion, 28M words • Gold set • Handcrafted collocation dictionaries • 56888 collocations

  10. Initial Experiment (2) • Precision • Baseline • Frequency, log-likelihood, mutual information • Log-likelihood achieves the best performance

  11. Initial Experiment (3) • Observation • Precision is low • Small gold set (57K/200K = 28%) • Low precision when N < 20K

  12. Observation • Frequency vs. Probability vs. Precision • Precision curve • Lower freq --> lower precision • Alignment probability curve • Lower freq --> higher probability

  13. Observation (2) • Conclusion • What causes lower precision of top 20K? • Collocation with low freq but high probability

  14. Improved MWA Method • Add a penalization function y=f(x), x=freq(w1,w2) • When x is small, y approaches 0 (penalize) • When x is large, y approaches 1 (do not penalize) • y = e-b/x (b is tuned to 25) • New ranking score

  15. Further Evaluation • Automatic evaluation • Greatly outperforms the best baseline • For top 1K, 20.6% vs. 11.7% • Exponential function plays a key role

  16. Further Evaluation (2) • Human Evaluation • Top 1K collocations • For each collocation, tag "True" or "False" • 4 "False" cases • A: two semantically related words • (醫生, 護士) • B: a part of multi-word collocation(>= 3 words) • (自我, 機制) in (自我, 約束, 機制) • C: high frequency bigram • (他, 說), (這, 是), (很, 好) • D: two words co-occurring frequently • (北京, 月), (和, 為)

  17. Further Evaluation (3) • True collocations are much more than baseline • False collocation • A: semantically related, not distinguishable by MWA • B: only two-word collocation is extracted. • Few collocations have >=3 words • C: frequent bigram, not distinguishable by MWA • D: much less than baseline

  18. Further Evaluation (3) cont. • MWA are able to produce long-span collocations • 48 extracted collocations with span > 6 • 33 are tagged "True" • ("處於", "狀態"), ("由於", "因此") • 69% precision

  19. Fertility vs. Precision • Manually label 100 sentences and observe fertility • 78% words collocate with 1 word • 17% words collocate with 2 words • 95% words have fertility <= 2 • Limit Φmax

  20. Conclusion • Main contribution • Successfully adapt BWA to MWA • Propose a ranking method • Alignment probability + Exponential penalty function • Initial failure are well discussed • Future work • Improving Statistical Machine Translation with Monolingual Collocation, ACL 2010 • Improve alignment, phrase table

More Related