1 / 23

XRCE at CLEF 07 Domain-specific Track

Explore strategies for mono- and cross-lingual information retrieval, from PRF to lexical entailment. Learn how to adapt query language models effectively for improved search results. Discover the impact of machine translation on retrieval accuracy.

tblanche
Download Presentation

XRCE at CLEF 07 Domain-specific Track

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XRCE at CLEF 07Domain-specific Track Stephane Clinchant andJean-Michel Renders (presented byGabriela Csurka) Xerox Research Centre Europe France

  2. Outline • Introduction • Mono-lingual Domain-Specific Information Retrieval • Query Language Model Refinement with PRF • Lexical Entailment • Results • Cross-lingual Domain-Specific Information Retrieval • Machine translation (Matrax) • Dictionary Adaptation • Results • Conclusion

  3. CLEF domain-specific track • Domain-Specific Information Retrieval • Leveraging the structure of data in collections (i.e. controlled vocabularies and other metadata) to improve search. • Multilingual database in social science domain • German Social Science Information Centre’s databases • Social Science Research Projects) databases • Tasks • Mono-lingual retrieval: queries and documents are in the same language • Cross-lingual retrieval: queries in one language are used with a collection in a different language.

  4. Information retrieval • Simply keyword matching is not enough to retrieve the best documents for a query. Information need d1 matching d2 query … dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University

  5. IR with Language Modeling • Treat each document as a multinomial distribution model d Information need d1 generation d2 query … … • Then the documents d in the corpus can be • either by • or computing a cross-entropy similarity between d and q dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University

  6. How we estimate d • A simple language model could be obtained (Maximul Likelihood) by considering the frequency of words in • The probabilities are smoothed by the corpus language model by: • We used Jelinek-Mercer interpolation • The role of smoothing the language model is: • LM more accurate (the query word can be absent in a document) • The has an IDF effect (renormalize the frequency of words with respect to its occurence in the corpus C )

  7. Query LM Refinement with PRF • Aim: Adapt (refine) the LM of a particular query • How: Detecting “implicitrelevantconcept” present in the retrieved documents using pseudo-relevance feedback (PRF) Pseudo-Relevance Feedback: Top N ranked text based on query similarity Final rank: Re-ranked documents based on refined query similarity Query … F q + (1-)F q …

  8. How to estimate F • Let F(q)={d1,d2, ..dN} be the N most relevant document for query q • Draw di following: • With F assumed to be multinomial (peaked at relevant terms). • Then With F is estimated by EM algorithm from the global likelihood : where P(w| C ) is word probability built upon the Corpus and  (=0.5) a fix parameter Zhai and Lafferty, SIGIR 2001.

  9. Lexical Entailment • Lexical Entailment : A thesaurus built automatically on a given Corpus • Given by the probabilities that one term entails another term based on the Corpus: • which isfiltered using the information gain and an additional parameter enables us to increase the weights given to the self-entailment P(u|u). • Applied for IR: words from the document are translated into different query terms • If we add a background smoothing, we obtain: • Pros : Finds relation between terms that feedback can not. • Cons: Heavier to compute and queries gets longer. S. Clinchant C. Goutte E. Gaussier Lexical Entailment for Information Retrieval ECIR 06

  10. Other results for comparison

  11. DS 07- Monolingual Official Runs • PRF Lexical Entailment : is a Double Lexical Entailment model where a first LE model is used to provide the system with an initial set of TOPn documents, from which a mixture model for pseudo-feedback is built, and a second retrieval is performed based once again on the LE model applied to the enriched query.

  12. Cross-Lingual -IR Documents (in target language) Information need • “Translation” d1 Query (in source language ??? d2 … Query translation dn Document translation

  13. What to translate? • Document translation - translate documents into the query language • Pro: translation may be (theoretically) more precise and documents become “readable” by the user • Cons: huge volume to be translated • Query translation - translate the query into document language • Pros: flexibility (translation on demand) and less text to translate • Cons: less precise and the retrieved documents need to be translated to be readable.

  14. How to translate ? • Statistical Machine Translation: MatraX • Alignment Model: learnt on parallel Corpus (JRC-AC with Giza++ word alignment) • Language Model: (N-gram) Learnt on GIRT Corpus • Translates the source sentence into the K target sentences • Use them as “mono-lingual query” • Dictionary Based Approach with or without adaptation • Extract a probabilistic bilingual dictionary from different resources (standard, domain-specific thesaurus, JRC-AC) • Use the translated query with mono-lingual retrieval approaches • Adapt the dictionary to a particular (query (feedback), target corpus) • Use the adapted query with mono-lingual retrieval approaches

  15. Dictionary based CLIR without Adaptation • Idea: Rank the documents • according to the “cross-entropy similarity” between the language model of the query and the language model of the document • using a probabilistic bilingual dictionary given by P(wt | ws), the probabilitythat the word ws is translated in wt

  16. Dictionary Adaptation • Aim: Adapt the dictionary to a particular query • How: Detecting “implicitcoherence” present in the relevant documents using PR • The first IR (CL-LM) can be seen also as a dictionary disambiguation process. Source Query Dictionary PRF Top N ranked text Final rank: Re-ranked target documents Adapted Dictionary P(wt|ws) … st CL-LM qs CL-LM …

  17. How to estimate st = • Let F(q) be the relevant documents retrieved by the translated query using (CL-LM). • The global model likelihood becomes • The estimation of st is done by EM initializing it by the general dictionary P(wt | ws). • Finally, we apply (CL-LM)again, but with the adapted query language model: Note: • This is an extension of the ”Query LM Refinement with PRF” to multi-lingual case • This algorithm realizes both the queryenrichment and dictionary adaptation

  18. Other results for comparison

  19. DS 07- Bilingual Official Runs

  20. Conclusion • Mono-lingual Domain-Specific Information Retrieval • The Query LM refinement with PRF (LM+PRF) give better performance than the Lexical Entailment (Table 7) but unlike the latter it makes 2 retrieval steps. • Both over-perform the non-adapted LM case (Table 9) • Combining them allowed for further improvements (Table 7) • Cross-lingual Domain-Specific Information Retrieval • Combining the QT with further adaptation was benefic (Table 9) • Matrax is bettter than dictionary based IR without adaptation (Table 9) • The dictionary adaptation method gave better results than the query translation with Matrax independently from the further mono-language adaptation (Table 8) • Combining the Lexical Entailment with LM+PRF was benefic in the cross-lingual case too (Table 8)

  21. Thank you for your attention! Not satisfied with the answer? You can always get answer directly from the authors: • Stephane.Clinchant@xrce.xerox.com or • Jean-Michel.Render@xrce.xerox.com

  22. Back-up

  23. MatraX Bi-phrase library Pre-processing JRC-AC Bi-phrase library construction GIRT Training set Model parameter optimization Development set Language Modeling (SRI LM-toolkit) Decoder Language Model Model params

More Related