1 / 16

XRCE Participation to TEL

XRCE Participation to TEL. Jean-Michel Renders Stephane Clinchant. Xerox Research Center Europe 6 chemin de Maupertuis 38240 Meylan, France. Outline. Multilinguism Bilingual Dictionary Adaptation Some Experiments. Multilinguism and TEL Collections. Text in German. Text in English.

hseger
Download Presentation

XRCE Participation to TEL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XRCE Participation to TEL Jean-Michel Renders StephaneClinchant Xerox Research Center Europe 6 chemin de Maupertuis 38240 Meylan, France

  2. Outline • Multilinguism • Bilingual Dictionary Adaptation • Some Experiments

  3. Multilinguism and TEL Collections Text in German Text in English Text in French Document in French Multilingual Collection Document in German AND English

  4. MultiLinguism • MultiLingual Documents: • Different Languages between documents and in the document • Relevant Documents could possibly be in any language ! • Monolingual Task is not Monolingual • Monolingual means : query language = Main Language of Collection • Bilingual means : query language != Main Language of Collection • Needs to translate queries for « monolingual » case • A possible Approach to MultiLinguism: • Index each Language separately • Late Fusion of Results

  5. Our approach to Multilinguism for TEL corpus • Merge all the languages to a uniq meta-Language: • Words = ( French Words, English Words, German Words) • But « Gauguin » is not the same word in french than in german (Diff. Inverted List) • Build a uniq index for a collection ( 1 for BNF, 1 for BL , 1 for ONB) • Needs of a global multilingual translation of queries • ( != several cross- lingual translations) • Requires Merging of Bilingual Dictionaries. • Late Fusion of results → Early Fusion of Dictionaries • Prior Weights for merging resources

  6. Collection Index Our strategy Thesaurus English to English Dictionary French To English P(wt|ws) First Translation of Query Dictionary German to English Query Retrieve Adapted Dictionary P’(wt|ws,q) new Translation of Query Retrieve and PRF

  7. Dictionary based CLIR • Translate the query • using a probabilistic bilingual dictionary P(wt | ws) • β controlling amount of translation • And monolingual language model ..(Cross Entropy)

  8. Dictionary Adaptation • Similar Idea found in Hiemstra (CLEF 2000) • Introduce our version for Domain Specific Track 07 • Main Idea: • Retrieval is disambiguating process • Relevant Documents contains the context of query terms translations: they are implicitly coherent …. • SO DO PSEUDO RELEVANT DOCUMENTS !

  9. Monolingual Pseudo Feedback with LM • MIXTURE MODEL ( C.Zhai 2001 ) • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • the distribution of the relevant topic model ( θ ) • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution. • Selected Words = the most probable words in θ

  10. Bilingual Dictionary Adaptation • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • a query term (associated distribution from a dictionary) P(wt | qs) θst • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution.

  11. How to estimate st • The estimation of st is done by EM initializing it by the intial dictionary • New Translation Filtering Effect: - words not in the top F are filtered • |F| =50,100 - weights are reestimated • Idem for monolingual and thesaurus

  12. Our official runs and our mistakes … • Lost relevant documents at indexing: • Kept Only English, French,German 240(BL) , 108(BNF) , 69(ONB) • Dictionary were not biased toward target collection but source language • Bad Translation of Queries : • β Parameter (Amount of Translation) identical for bilingual and monlingual runs

  13. Some Pure Bilingual Experiments Resource:JRC corpus to extract dictionaries 3% Average Improvement

  14. Post Analysis of Multilinguism

  15. Conclusion • Multilinguism: Theory vs Practise • In theory seems a good idea • In practise, most best runs are “pure” monolingual or “pure” bilingual. • Dictionary Adaptation: • Partial solution to the problem of setting prior weights. • Got some improvements on sparse data this year • Partly Financed by

  16. Thank you for your attention!

More Related