230 likes | 246 Views
Explore strategies for mono- and cross-lingual information retrieval, from PRF to lexical entailment. Learn how to adapt query language models effectively for improved search results. Discover the impact of machine translation on retrieval accuracy.
E N D
XRCE at CLEF 07Domain-specific Track Stephane Clinchant andJean-Michel Renders (presented byGabriela Csurka) Xerox Research Centre Europe France
Outline • Introduction • Mono-lingual Domain-Specific Information Retrieval • Query Language Model Refinement with PRF • Lexical Entailment • Results • Cross-lingual Domain-Specific Information Retrieval • Machine translation (Matrax) • Dictionary Adaptation • Results • Conclusion
CLEF domain-specific track • Domain-Specific Information Retrieval • Leveraging the structure of data in collections (i.e. controlled vocabularies and other metadata) to improve search. • Multilingual database in social science domain • German Social Science Information Centre’s databases • Social Science Research Projects) databases • Tasks • Mono-lingual retrieval: queries and documents are in the same language • Cross-lingual retrieval: queries in one language are used with a collection in a different language.
Information retrieval • Simply keyword matching is not enough to retrieve the best documents for a query. Information need d1 matching d2 query … dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University
IR with Language Modeling • Treat each document as a multinomial distribution model d Information need d1 generation d2 query … … • Then the documents d in the corpus can be • either by • or computing a cross-entropy similarity between d and q dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University
How we estimate d • A simple language model could be obtained (Maximul Likelihood) by considering the frequency of words in • The probabilities are smoothed by the corpus language model by: • We used Jelinek-Mercer interpolation • The role of smoothing the language model is: • LM more accurate (the query word can be absent in a document) • The has an IDF effect (renormalize the frequency of words with respect to its occurence in the corpus C )
Query LM Refinement with PRF • Aim: Adapt (refine) the LM of a particular query • How: Detecting “implicitrelevantconcept” present in the retrieved documents using pseudo-relevance feedback (PRF) Pseudo-Relevance Feedback: Top N ranked text based on query similarity Final rank: Re-ranked documents based on refined query similarity Query … F q + (1-)F q …
How to estimate F • Let F(q)={d1,d2, ..dN} be the N most relevant document for query q • Draw di following: • With F assumed to be multinomial (peaked at relevant terms). • Then With F is estimated by EM algorithm from the global likelihood : where P(w| C ) is word probability built upon the Corpus and (=0.5) a fix parameter Zhai and Lafferty, SIGIR 2001.
Lexical Entailment • Lexical Entailment : A thesaurus built automatically on a given Corpus • Given by the probabilities that one term entails another term based on the Corpus: • which isfiltered using the information gain and an additional parameter enables us to increase the weights given to the self-entailment P(u|u). • Applied for IR: words from the document are translated into different query terms • If we add a background smoothing, we obtain: • Pros : Finds relation between terms that feedback can not. • Cons: Heavier to compute and queries gets longer. S. Clinchant C. Goutte E. Gaussier Lexical Entailment for Information Retrieval ECIR 06
DS 07- Monolingual Official Runs • PRF Lexical Entailment : is a Double Lexical Entailment model where a first LE model is used to provide the system with an initial set of TOPn documents, from which a mixture model for pseudo-feedback is built, and a second retrieval is performed based once again on the LE model applied to the enriched query.
Cross-Lingual -IR Documents (in target language) Information need • “Translation” d1 Query (in source language ??? d2 … Query translation dn Document translation
What to translate? • Document translation - translate documents into the query language • Pro: translation may be (theoretically) more precise and documents become “readable” by the user • Cons: huge volume to be translated • Query translation - translate the query into document language • Pros: flexibility (translation on demand) and less text to translate • Cons: less precise and the retrieved documents need to be translated to be readable.
How to translate ? • Statistical Machine Translation: MatraX • Alignment Model: learnt on parallel Corpus (JRC-AC with Giza++ word alignment) • Language Model: (N-gram) Learnt on GIRT Corpus • Translates the source sentence into the K target sentences • Use them as “mono-lingual query” • Dictionary Based Approach with or without adaptation • Extract a probabilistic bilingual dictionary from different resources (standard, domain-specific thesaurus, JRC-AC) • Use the translated query with mono-lingual retrieval approaches • Adapt the dictionary to a particular (query (feedback), target corpus) • Use the adapted query with mono-lingual retrieval approaches
Dictionary based CLIR without Adaptation • Idea: Rank the documents • according to the “cross-entropy similarity” between the language model of the query and the language model of the document • using a probabilistic bilingual dictionary given by P(wt | ws), the probabilitythat the word ws is translated in wt
Dictionary Adaptation • Aim: Adapt the dictionary to a particular query • How: Detecting “implicitcoherence” present in the relevant documents using PR • The first IR (CL-LM) can be seen also as a dictionary disambiguation process. Source Query Dictionary PRF Top N ranked text Final rank: Re-ranked target documents Adapted Dictionary P(wt|ws) … st CL-LM qs CL-LM …
How to estimate st = • Let F(q) be the relevant documents retrieved by the translated query using (CL-LM). • The global model likelihood becomes • The estimation of st is done by EM initializing it by the general dictionary P(wt | ws). • Finally, we apply (CL-LM)again, but with the adapted query language model: Note: • This is an extension of the ”Query LM Refinement with PRF” to multi-lingual case • This algorithm realizes both the queryenrichment and dictionary adaptation
Conclusion • Mono-lingual Domain-Specific Information Retrieval • The Query LM refinement with PRF (LM+PRF) give better performance than the Lexical Entailment (Table 7) but unlike the latter it makes 2 retrieval steps. • Both over-perform the non-adapted LM case (Table 9) • Combining them allowed for further improvements (Table 7) • Cross-lingual Domain-Specific Information Retrieval • Combining the QT with further adaptation was benefic (Table 9) • Matrax is bettter than dictionary based IR without adaptation (Table 9) • The dictionary adaptation method gave better results than the query translation with Matrax independently from the further mono-language adaptation (Table 8) • Combining the Lexical Entailment with LM+PRF was benefic in the cross-lingual case too (Table 8)
Thank you for your attention! Not satisfied with the answer? You can always get answer directly from the authors: • Stephane.Clinchant@xrce.xerox.com or • Jean-Michel.Render@xrce.xerox.com
MatraX Bi-phrase library Pre-processing JRC-AC Bi-phrase library construction GIRT Training set Model parameter optimization Development set Language Modeling (SRI LM-toolkit) Decoder Language Model Model params