1 / 17

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Eric Brill Gary Kacmarcik

國立雲林科技大學 National Yunlin University of Science and Technology. Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Eric Brill Gary Kacmarcik Chris Brockett.

lever
Download Presentation

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Eric Brill Gary Kacmarcik

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 國立雲林科技大學National Yunlin University of Science and Technology • Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Eric Brill • Gary Kacmarcik • Chris Brockett Microsoft Research,NLPRS,2001

  2. Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Noisy Channel Error Model • Experiments • Conclusions • Opinion

  3. Motivation • N.Y.U.S.T. • I.M. • Out-Of-Vocabulary words pose a notorious headache for human translators • Stumble block for quality machine translation (MT) and multilingual IR <本拉登, Bin Laden>

  4. Objective • N.Y.U.S.T. • I.M. • back-transliteration from katakana to English • to find <katakana, English> pairs • for application during runtime machine translation (MT) and multilingual IR

  5. Introduction • N.Y.U.S.T. • I.M. • Katakana script is used to represent foreign loan words • Search engines must deal with the flood of OOV words

  6. Noisy channel Error Model • N.Y.U.S.T. • I.M. • |a|,|b|<=4, English romanized-katakana non-match(noisy) is :

  7. Noisy channel Error Model • N.Y.U.S.T. • I.M. a k g s u a l a c t u a l • |a|,|b|<=2, English romanized-katakana non-match is :

  8. Noisy channel Error Model • N.Y.U.S.T. • I.M. • show some high probability edits learned to map English to romanized katakana:

  9. Harvesting Training Data • N.Y.U.S.T. • I.M. • extract a database of English and Japanese queries from the MSN Search query logs 1.Katakana strings in encyclopedia is small and isn’t growing over time. 2.Encyclopedias are static, and don’t contain new names and phrases

  10. Harvesting Training Data • N.Y.U.S.T. • I.M. • 461,567 sentences only 40,127 unique katakana strings • acquire 10,000 new katakana strings each day

  11. Harvesting from Non-Aligned Query Databases • N.Y.U.S.T. • I.M.

  12. Growing a Bilingual Lexicon • N.Y.U.S.T. • I.M. • shinguru shingle & single • faibaa fiver & fiber • pakkuman packman & pac-man • maimu maim -> mime • rainingu raining -> lining • purizumu purism -> prism • posuto past & post • retasu retrace & lettuce • bangaroo kangaroo & bungalow

  13. Improving the Noisy Channel Error Model • N.Y.U.S.T. • I.M. • extract katakana-English word pairs culled from : 1.Kenkyusha New College Japanese-English Dictionary 2.Terms extracted from in house localization databases 3.Iwanami Kokugojiten pocket dictionary • Consist of a rather general collection terms and proper names, mostly geographical names

  14. Improving the Noisy Channel Error Model • N.Y.U.S.T. • I.M. • Baseline plot uses a fairly conservative filter • More aggressive filters (allow more “noisy” data through)

  15. Improving the Noisy channel Error Model • N.Y.U.S.T. • I.M. • Occur at least 100 times (100 Threshold)

  16. Conclusions • N.Y.U.S.T. • I.M. • Robust utility in acquiring <katakana,English> pairs • For IR and MT

  17. Opinion • N.Y.U.S.T. • I.M. • Assist with our research 本拉登 羅馬拼音:ben la ding 外來詞: bin laden

More Related