html5-img
1 / 17

Combining Query Translation and Document Translation in Cross-Language Retrieval

Combining Query Translation and Document Translation in Cross-Language Retrieval. Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley. CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway.

zurina
Download Presentation

Combining Query Translation and Document Translation in Cross-Language Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Query Translation and Document Translation in Cross-Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway

  2. Talk Outline • Development of new resources • Fast approximate document translation • Combining query translation and document translation • Conclusions

  3. New Resources • Finnish and Swedish stoplists • Base Finnish and Swedish lexicons for decompounding • Statistical translation lexicons derived from parallel texts • Finnish and Swedish statistical stemmers automatically generated from parallel texts • English spelling normalizer

  4. Development of Swedish Stoplist(by someone who doesn’t know Swedish) Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English. • en park (a park) • ett piano (a piano) • Jag vet intemycketomhonom (Idon’t know muchabouthim) • efter skolan (after school) • Hans och Greta (Hans and Greta) (Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)

  5. Development of Swedish Base Lexicon A base lexicon should contain all and only the words and their variants that are not compounds. • Compile a list of Swedish words (e.g., from the Swedish document collection). • Remove the words that are 4 or fewer characters long. • Remove the long words that can be decomposed into short words in the initial wordlist. animation animationen dator datoranimation datorgrafik datorteknologi datorvirus grafik teknologi virus Remove the compounds that are decomposed. dator animation dator grafik dator teknologi dator virus

  6. Development of Statistical Translation Lexicons from Parallel Texts parallel texts (EU Official Journal) PDFtexts conversion paragraph & sentence alignment statistical association statistical MT toolkit • Italian  Spanish • German  Italian • Finnish  German • English  Dutch • English  Finnish • English  Swedish • Dutch  English • Finnish  English • Swedish  English statistical translation lexicons

  7. Development of Statistical Stemmers “computer” cluster statistical English translations Swedish words dator dator datorn datorer datorersom datornät datornernä informatik dator datorn datorer datorersom datornät datornernä diamanten diamanterna diamanter diamant informatik computer computers computer diamond diamonds diamond “diamond” cluster diamant diamanten diamanterna diamanter diamant

  8. Fast Approximate Document Translation 2 List of Spanish words List of English words Spanish documents 1 Spanish-English MT 3 Word-by-word Bilingual Spanish-English wordlist English translations 4

  9. Query Translation-based Multilingual Retrieval Query Documents IR English English IR French French IR L&H German German IR Spanish Spanish English docs French docs German docs Spanish docs merger combined ranked list of documents

  10. Documentation Translation-based Multilingual Retrieval Documents English Query English French IR English English German English Spanish unified ranked list of documents

  11. Evaluation of Multilingual Retrieval Multilingual-4: English, TD Multilingual-8: English, TD

  12. Query Translation v.s. Document Translation Spanish doc words German doc words English words in topic 161 Diets for Celiacs celíacos dietas diät zöliakie document translation (word-by-word) query translation Las Dietas para Celiacs Nahrungen für Celiacs celiacs diets diets coeliac diseases (Spanish) (German) (English) Average precision: 0.0003 (mul4en1) Average precision: 0.6750 (mul4en2) English words in topic 186 French document words Dutch Netherlands Néerlandais Pays-Bas 0.0 document translation (word-by-word) query translation 1.0 Hollandais Hollande Dutch Netherlands (French) (English) Average precision: 0.2213 (mul4en1) Average precision: 0.6167 (mul4en2)

  13. Manual v.s. Automatic Stemming CLEF 2003 (topic fields: TD. No decompounding or query expansion) CLEF2001-2002 (topic fields: TD. No query expansion)

  14. Evaluation of Decompounding, Stemming and Query Expansion in Monolingual Retrieval Topics (TD) Dutch German Finnish Swedish .5304 (22.16%) .5678 (52.35%) .5633 (48.20%) .5465 (50.55%) decomp+stem+expan .4962 .4804 .5541 .4838 .5126 .5473 .4469 .4880 .4955 .5111 .4972 .4727 decomp+expan stem+expan decomp+stem .4744 .4294 .4204 .4331 .4673 .4867 .4071 .4224 .4480 .4220 .4974 .4121 stem expan decomp .4342 .3727 .3801 .3630 baseline

  15. Conclusions • Fast approximate document-translation worked well. Combining document-translation with query-translation was even better. • Decompounding with stemming and query expansion worked well for languages with rich compounds. • Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.

  16. Software Berkeley Text Retrieval System is available for research purpose. Send request to aitao@sims.berkeley.edu

  17. THANK YOU

More Related