XRCE at CLEF 07 Domain-specific Track

XRCE at CLEF 07Domain-specific Track Stephane Clinchant andJean-Michel Renders (presented byGabriela Csurka) Xerox Research Centre Europe France

Outline • Introduction • Mono-lingual Domain-Specific Information Retrieval • Query Language Model Refinement with PRF • Lexical Entailment • Results • Cross-lingual Domain-Specific Information Retrieval • Machine translation (Matrax) • Dictionary Adaptation • Results • Conclusion

CLEF domain-specific track • Domain-Specific Information Retrieval • Leveraging the structure of data in collections (i.e. controlled vocabularies and other metadata) to improve search. • Multilingual database in social science domain • German Social Science Information Centre’s databases • Social Science Research Projects) databases • Tasks • Mono-lingual retrieval: queries and documents are in the same language • Cross-lingual retrieval: queries in one language are used with a collection in a different language.

Information retrieval • Simply keyword matching is not enough to retrieve the best documents for a query. Information need d1 matching d2 query … dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University

IR with Language Modeling • Treat each document as a multinomial distribution model d Information need d1 generation d2 query … … • Then the documents d in the corpus can be • either by • or computing a cross-entropy similarity between d and q dn Courtesy: Drawing borrowed from C Manning and P. Raghavan lectures at Standford University

How we estimate d • A simple language model could be obtained (Maximul Likelihood) by considering the frequency of words in • The probabilities are smoothed by the corpus language model by: • We used Jelinek-Mercer interpolation • The role of smoothing the language model is: • LM more accurate (the query word can be absent in a document) • The has an IDF effect (renormalize the frequency of words with respect to its occurence in the corpus C )

Query LM Refinement with PRF • Aim: Adapt (refine) the LM of a particular query • How: Detecting “implicitrelevantconcept” present in the retrieved documents using pseudo-relevance feedback (PRF) Pseudo-Relevance Feedback: Top N ranked text based on query similarity Final rank: Re-ranked documents based on refined query similarity Query … F q + (1-)F q …

How to estimate F • Let F(q)={d1,d2, ..dN} be the N most relevant document for query q • Draw di following: • With F assumed to be multinomial (peaked at relevant terms). • Then With F is estimated by EM algorithm from the global likelihood : where P(w| C ) is word probability built upon the Corpus and  (=0.5) a fix parameter Zhai and Lafferty, SIGIR 2001.

Lexical Entailment • Lexical Entailment : A thesaurus built automatically on a given Corpus • Given by the probabilities that one term entails another term based on the Corpus: • which isfiltered using the information gain and an additional parameter enables us to increase the weights given to the self-entailment P(u|u). • Applied for IR: words from the document are translated into different query terms • If we add a background smoothing, we obtain: • Pros : Finds relation between terms that feedback can not. • Cons: Heavier to compute and queries gets longer. S. Clinchant C. Goutte E. Gaussier Lexical Entailment for Information Retrieval ECIR 06

Other results for comparison

DS 07- Monolingual Official Runs • PRF Lexical Entailment : is a Double Lexical Entailment model where a first LE model is used to provide the system with an initial set of TOPn documents, from which a mixture model for pseudo-feedback is built, and a second retrieval is performed based once again on the LE model applied to the enriched query.

Cross-Lingual -IR Documents (in target language) Information need • “Translation” d1 Query (in source language ??? d2 … Query translation dn Document translation

What to translate? • Document translation - translate documents into the query language • Pro: translation may be (theoretically) more precise and documents become “readable” by the user • Cons: huge volume to be translated • Query translation - translate the query into document language • Pros: flexibility (translation on demand) and less text to translate • Cons: less precise and the retrieved documents need to be translated to be readable.

How to translate ? • Statistical Machine Translation: MatraX • Alignment Model: learnt on parallel Corpus (JRC-AC with Giza++ word alignment) • Language Model: (N-gram) Learnt on GIRT Corpus • Translates the source sentence into the K target sentences • Use them as “mono-lingual query” • Dictionary Based Approach with or without adaptation • Extract a probabilistic bilingual dictionary from different resources (standard, domain-specific thesaurus, JRC-AC) • Use the translated query with mono-lingual retrieval approaches • Adapt the dictionary to a particular (query (feedback), target corpus) • Use the adapted query with mono-lingual retrieval approaches

Dictionary based CLIR without Adaptation • Idea: Rank the documents • according to the “cross-entropy similarity” between the language model of the query and the language model of the document • using a probabilistic bilingual dictionary given by P(wt | ws), the probabilitythat the word ws is translated in wt

Dictionary Adaptation • Aim: Adapt the dictionary to a particular query • How: Detecting “implicitcoherence” present in the relevant documents using PR • The first IR (CL-LM) can be seen also as a dictionary disambiguation process. Source Query Dictionary PRF Top N ranked text Final rank: Re-ranked target documents Adapted Dictionary P(wt|ws) … st CL-LM qs CL-LM …

How to estimate st = • Let F(q) be the relevant documents retrieved by the translated query using (CL-LM). • The global model likelihood becomes • The estimation of st is done by EM initializing it by the general dictionary P(wt | ws). • Finally, we apply (CL-LM)again, but with the adapted query language model: Note: • This is an extension of the ”Query LM Refinement with PRF” to multi-lingual case • This algorithm realizes both the queryenrichment and dictionary adaptation

Other results for comparison

DS 07- Bilingual Official Runs

Conclusion • Mono-lingual Domain-Specific Information Retrieval • The Query LM refinement with PRF (LM+PRF) give better performance than the Lexical Entailment (Table 7) but unlike the latter it makes 2 retrieval steps. • Both over-perform the non-adapted LM case (Table 9) • Combining them allowed for further improvements (Table 7) • Cross-lingual Domain-Specific Information Retrieval • Combining the QT with further adaptation was benefic (Table 9) • Matrax is bettter than dictionary based IR without adaptation (Table 9) • The dictionary adaptation method gave better results than the query translation with Matrax independently from the further mono-language adaptation (Table 8) • Combining the Lexical Entailment with LM+PRF was benefic in the cross-lingual case too (Table 8)

Thank you for your attention! Not satisfied with the answer? You can always get answer directly from the authors: • Stephane.Clinchant@xrce.xerox.com or • Jean-Michel.Render@xrce.xerox.com

Back-up

MatraX Bi-phrase library Pre-processing JRC-AC Bi-phrase library construction GIRT Training set Model parameter optimization Development set Language Modeling (SRI LM-toolkit) Decoder Language Model Model params

XRCE at CLEF 07 Domain-specific Track

XRCE at CLEF 07 Domain-specific Track

Presentation Transcript

Domain-Specific Corpora

The Domain-Specific Track at CLEF 2007

Patent Track @ CLEF

Domain-Specific Software Engineering

Domain Specific Languages

How domain specific are Domain Specific Languages?

Topics = Domain-Specific Concepts

Domain-Specific Software Engineering

Domain-Specific Languages:

Adding Domain-Specific Knowledge

CLEF 2007 Multilingual Question Answering Track

Domain Specific Language

Océ at CLEF 2003

The Multiple Language Question Answering Track at CLEF 2003

CLEF 2008 Multilingual Question Answering Track

Domain Specific Languages

The CLEF 2005 interactive track (iCLEF)

Domain Specific Languages

Domain Specific Models

Domain Specific Languages

XRCE at ImageCLEF 07