slide1
Download
Skip this Video
Download Presentation
soul decided blue bridge .

Loading in 2 Seconds...

play fullscreen
1 / 38

soul decided blue bridge . - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

The Soul Lost in the Blue Bridge. soul decided blue bridge . Mining Translations for Key Phrases from Web Corpora. Ying Zhang (Joy), Fei Huang Stephan Vogel CMU/LTI MT Lunch Presentation April 19 2005. Outline. Motivation Crosslingual query expansion Key phrase translation extraction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'soul decided blue bridge .' - tangia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
The Soul Lost in the Blue Bridge

soul decided blue bridge .

Mining Key Phrase Translations from Web Corpora

mining translations for key phrases from web corpora

Mining Translations for Key Phrases from Web Corpora

Ying Zhang (Joy), Fei Huang

Stephan Vogel

CMU/LTI MT Lunch Presentation

April 19 2005

outline
Outline
  • Motivation
  • Crosslingual query expansion
  • Key phrase translation extraction
  • Experiments
  • Conclusion and future work

Mining Key Phrase Translations from Web Corpora

key phrase
Key Phrase
  • Definition
    • Named-entities:
      • person, organization and location
    • Book/movie titles
    • Terminology (Medical, Sci&Tech, Military, …)
  • Most of them are compound nouns
    • The meaning can not be directly derived from its components
    • Requires more world knowledge to translate
  • Important for NLP applications:
    • Machine Translation (MT)
    • Cross-lingual Information Retrieval (CLIR)
    • Question-Answering (QA)
  • Most of them are OOV 

Mining Key Phrase Translations from Web Corpora

searching the web for the translation
Searching the web for the translation?
  • Searching the parallel data on the web (e.g. STRAND: Resnik 2003)

Mining Key Phrase Translations from Web Corpora

bilingual information on the web
Bilingual Information on the Web
  • Searching the parallel data on the web (Resnik 2003)
  • Searching the comparable corpus on the web (Fung 1998)

Mining Key Phrase Translations from Web Corpora

bilingual information on the web1
Bilingual Information on the Web
  • Searching the parallel data on the web (Resnik 2003)
  • Searching the comparable corpus on the web (Fung 1998)
  • Anchor texts pointing to the same page (Lu 2004)

Mining Key Phrase Translations from Web Corpora

bilingual information on the web2
Bilingual Information on the Web
  • Limited bilingual resources as parallel/comparable on the web 
    • STRAND: 3,500 English-Chinese document pairs and fewer than 2,500 for English-French. (Resnik 2003 )
    • Comparable corpora: from 10 years Xinhua Chinese and English stories (2GB) only 110K sentence pairs (44MB) are found as “parallel”. (Zhao & Vogel 2002)
    • Anchor text mining: from 2M web pages, 2.8MB Chinese text and 3.1MB English text found as potential translations.
  • More bilingual information on the web in the form of mixed language webpage
    • Parallel text are not needed in most cases
    • The Chinese authors usually include the original English for the key phrases
      • For consistency
      • To give the readers more information
      • If they are not sure about the translation in Chinese

Mining Key Phrase Translations from Web Corpora

web pages of mixed languages
Web pages of mixed languages

Mining Key Phrase Translations from Web Corpora

web pages of mixed languages1
Web pages of mixed languages

Mining Key Phrase Translations from Web Corpora

mining translations from mixed lang pages
Mining translations from mixed-lang. pages
  • Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004)
    • Use Google to locate the webpages containing the Chinese terms
    • English expressions occur next to the Chinese terms are considered as their translations
    • Crawled 2GB web data, 1,168 distinct English terms found, 61% are correct translations
  • Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004)
    • Use Google to retrieve “English” pages containing the Chinese terms
    • Extract translations from the snippets
    • LiveTrans system

Mining Key Phrase Translations from Web Corpora

mining translations from mixed lang pages1
Mining translations from mixed-lang pages

Mining Key Phrase Translations from Web Corpora

pros and cons of these approaches
Pros and cons of these approaches

Mining Key Phrase Translations from Web Corpora

our approach cross lingual query expansion
Our approach: cross-lingual query expansion
  • Query expansion: expanding the original query to better represent user’s “information need”
    • E.g. expand the query “cmu” to “cmu pittsburgh” if the user wants to find pages about “Carnegie Mellon University” instead of “Central Michigan University”.
  • Cross-lingual query expansion
    • Information need: not pages relevant to query Q, but pages containing the translation of Q
    • How to represent this “need”?
    • Observation: assuming that Q and Q’ are two relevant Chinese terms, when a webpage contains Q and its translation E, Q’ and its translation E’ are very likely to appear on the same page.
    • If we know Q’ and its translation E’, expand query Q with E’
    • Q’ and E’ are called hint words

Mining Key Phrase Translations from Web Corpora

cross lingual query expansion
Cross-lingual query expansion
  • A good Chinese hint word should be:
    • Relevant to the term to be translated
    • Easy to translate given the current resources
  • How to find this Chinese hint word?
    • Search Google
    • Select Chinese words Q’ with high frequency
    • Use only those having translations in the LDC lexicon
  • Search Google again with Q+E’ pairs

Mining Key Phrase Translations from Web Corpora

cross lingual query expansion1
Cross-lingual Query Expansion

Mining Key Phrase Translations from Web Corpora

cross lingual query expansion2
Cross-lingual Query Expansion

Mining Key Phrase Translations from Web Corpora

cross lingual query expansion3
Cross-lingual Query Expansion

Mining Key Phrase Translations from Web Corpora

cross lingual query expansion4
Cross-lingual Query Expansion

Mining Key Phrase Translations from Web Corpora

comparing with other approaches
Comparing with other approaches

Next step …

Mining Key Phrase Translations from Web Corpora

outline1
Outline
  • Motivation
  • Crosslingual query expansion
  • Key phrase translation extraction
    • Preprocessing
    • Multiple features
      • Transliteration model
      • Translation model
      • Frequency-distance model
    • Feature combination
  • Experiments
  • Conclusion and future work

Mining Key Phrase Translations from Web Corpora

preprocessing
Preprocessing
  • HTML tag filtering
  • Chinese word segmentation
  • Character replacement
    • Replacing punctuation with separator “|”
    • Replacing non-query Chinese words with “+”)
  • Grouping continuous English words into a phrase

廊桥遗梦》(the bridges of madison county) [review]. 发布者:anjing | 发布时间:2004-01-25 星期日02:13 | 最新更新时间

廊 桥 遗 梦》(the bridges of madison county) [review]. 发布 者:anjing | 发布 时间:2004-01-25 星期日 02:13 | 最新 更新 时间

| 廊 桥 遗 梦 | the bridges of madison county | review | ++ +| anjing | ++ ++ | 2004-01-25 +++ 02:13 | ++ ++ ++

| 廊桥遗梦 | the_bridges_of_madison_county | review | ++ + | anjing | ++ ++ | 2004-01-25 +++ 02 13 | + + ++ ++

Mining Key Phrase Translations from Web Corpora

phrase alignment features
Phrase Alignment Features
  • Transliteration model
    • Capture phonetic similarity
      • Person, location and brand names
    • Probabilistic surface string alignment
      • Romanized source phrases vs. target phrase
      • Letters are aligned according to their pronunciation similarity (not orthogonal forms)
      • Letter pronunciation similarities are automatically learned from bilingual NE lists using EM

Key phrase alignment path

(雅诗兰黛 vs. Estee Lauder)

Huang, Vogel and Waibel, Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, ACL 03 Multilingual NE Recognition Workshop

Mining Key Phrase Translations from Web Corpora

phrase alignment features1
Phrase Alignment Features
  • Translation model
    • Capture semantic similarity
      • Organization names
      • Science and technical terms
      • Military terms
    • Calculate phrase translation probability using IBM models

简 氏 防务 周刊

Jane’s defense weekly

Mining Key Phrase Translations from Web Corpora

phrase alignment features2
Phrase Alignment Features
  • Frequency-distance model
    • Frequency of co-occurrence
    • Distance within a snippet
      • si: returned snippet containing (f,e)
      • fi: ith occurrence of the source phrase in si
      • d: distance, i.e., how many words in between

d1

d2

马 语 者 | the_horse_whisperer | the_review | ++ ++ ++ ++ + | 马 语 者 | horse_whisperer | ac3 | cd1_verycd_com |+ ++ | peter_hewitt

Mining Key Phrase Translations from Web Corpora

phrase alignment features3
Phrase Alignment Features
  • Feature combination
    • Confidence measure of the transliteration model
    • Confidence measure of the translation model
    • Overall combined feature cost

Mining Key Phrase Translations from Web Corpora

outline2
Outline
  • Motivation
  • Crosslingual query expansion
  • Key phrase translation extraction
  • Experiments
  • Conclusion and future work

Mining Key Phrase Translations from Web Corpora

experiment
Experiment
  • Test set
    • 310 key phrases manually selected from 12 domains
    • Manual translation as reference
    • One phrase may have several correct translations

Mining Key Phrase Translations from Web Corpora

inclusion rate
Inclusion Rate

With Hint, Whole Web

No Hint, English Pages

No Hint, Whole Web

Define inclusion rate as:

# of phrases whose translation are included in the returned snippets

Total # of phrases

Mining Key Phrase Translations from Web Corpora

alignment accuracy
Alignment Accuracy

Define alignment accuracy as:

# of correct phrase translations

# of phrases whose translation can be retrieved in snippets

Mining Key Phrase Translations from Web Corpora

overall translation accuracy
Overall Translation Accuracy

LiveTrans: an OOV translator using web corpora

http://livetrans.iis.sinica.edu.tw/lt.html

Mining Key Phrase Translations from Web Corpora

sample translation results
Sample Translation Results

Mining Key Phrase Translations from Web Corpora

conclusion and future work
Conclusion and Future Work
  • Find key phrase translation via Web mining
    • Crosslingual query expansion find more relevant webpage snippets
    • Transliteration, translation and frequency-distance features extract correct translation
    • Significant improvements over several existing systems
  • Future work
    • Experimenting on other language pairs E.g. Arabic.
    • Select effective hint words based on richer features.
    • Flexible phrase boundary detection
    • Apply on MT tasks

Mining Key Phrase Translations from Web Corpora

references
References
  • Fung, P and Yee, L.Y. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proc. Of COLING-ACL, pp. 414-420, 1998.
  • F. Huang, S. Vogel and A. Waibel. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceeding of the 41st ACL, Workshop on Multilingual and Mixed-Language Named Entity Recognition, Sapporo, Japan, July 2003.
  • Lu, W.-H., Chien, L.-F., and Lee, H.-J. Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242-269, 2004.
  • P. Resnik and N. A. Smith. The web as a parallel corpus. Comput. Linguist., 29(3):349--380, 2003.
  • Y. Zhang and P. Vines. Detection and translation of oov terms prior to query time. In SIGIR '04, pages 524--525. ACM Press, 2004.
  • Y. Zhang, F. Huang and S. Vogel. Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion. In SIGIR ’05.

Mining Key Phrase Translations from Web Corpora

ad