Cross-Language Information Retrieval (CLIR)

Cross-Language Information Retrieval (CLIR) Natural Language Processing/Language Technology for the Web

Cross Language Information Retrieval(CLIR) “A subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query.” E.g., Using Hindi queries to retrieve English documents Also called multi-lingual, cross-lingual, or trans-lingual IR.

Why CLIR? E.g., On the web, we have: • Documents in different languages • Multilingual documents • Images with captions in different languages A single query should retrieve all such resources.

Approaches to CLIR most efficient; commonly used infeasible for large collections Most effective approaches are hybrid–a combination of knowledge and corpus-based methods.

Dictionary-based Query Translation • phrase identification • words to be transliterated आयरलैंड शांति वार्ता Hindi-English dictionaries search Collection Ireland peace talks

The problem with dictionary-based CLIR -- ambiguity

… filtering/disambiguation is required after query translation.

Disambiguation using co-occurrence statistics Hypothesis: correct translations of query terms will co-occur and incorrect translations will tend not to co-occur

Problem with counting co-occurrences: data sparsity freq(Marathi Shallow Parsing CRFs) freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs) … are all zero. How do we choose between parsing, structuring, and analyzing?

Pair-wise co-occurrence अंतरिक्षीय घटना cosmic outer-space incident event occurrence lessen subside decrease lower diminish ebb decline reduce freq(cosmic incident)  70800 freq(cosmic event  269000 freq(cosmic lessen)  7130 freq(cosmic subside)  3120 freq(outer-space incident)  26100 freq(outer-space event)  104000 freq(outer-space lessen)  2600 freq(outer-space subside)  980

Shallow Parsing, Structuring or Analyzing? shallow parsing  166000 shallow structuring  180000 shallow analyzing  1230000 CRFs parsing  540 CRFs structuring  125 CRFs analyzing  765 Marathi parsing  17100 Marathi structuring  511 Marathi analyzing  12200 “shallow parsing”  40700 “shallow structuring”  11 “shallow analyzing”  2 But, analyzing  74100000 parsing  40400000 structuring  17400000 shallow  33300000 collocation?

Ranking senses using co-occurrence statistics • Use co-occurrence scores to calculate similarity between two words: sim(x, y) • Point-wise mutual information (PMI) • Dice coefficient • PMI-IR

Disambiguation algorithm

Example अंतरिक्षीय घटना cosmic outer-space incident event lessen subside decrease lower diminish ebb decline reduce score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …

Disambiguation algorithm: sample outputs

Results on TREC8 (disks 4 and 5) • English topics (401-450) manually translated to Hindi • Assumption: relevance judgments for English topics hold for the translated queries • Results (all TF-IDF):

Pseudo-Relevance Feedback for CLIR

(User) Relevance Feedback (mono-lingual) • Retrieve documents using the user’s query • The user marks relevant documents • Choose the top N terms from these documents • Top terms  IDF is one option for scoring • Add these N terms to the user’s query to form a new query • Use this new query to retrieve a new set of documents

Pseudo-Relevance Feedback (PRF) (mono-lingual) • Retrieve documents using the user’s query • Assume that the top M documents retrieved are relevant • Choose the top N terms from these M documents • Add these N terms to the user’s query to form a new query • Use this new query to retrieve a new set of documents

PRF for CLIRCorpus-based Query Translation • Uses a parallel corpus of documents: H1 E1 H2  E2 . . . . . . Hm Em Hindi collection H English collection E

PRF for CLIR • Retrieve documents in H using the user’s query • Assume that the top M documents retrieved are relevant • Select the M documents in E that are aligned to the top M retrieved documents • Choose the top N terms from these documents • These N terms are the translated query • Use this query to retrieve from the target collection(which is in the same language as E)

Cross-Lingual Relevance Models - Estimate relevance models using a parallel corpus

Ranking with Relevance Models • Relevance model or Query model (distribution encodes the information need): • Probability of word occurrence in a relevant document • Probability of word occurrence in the candidate document • Ranking function (relative entropy or KL divergence)

Estimating Mono-Lingual Relevance Models

Estimating Cross-Lingual Relevance Models

CLIR Evaluation – TREC(Text REtrieval Conference) • TREC CLIR track (2001 and 2002) • Retrieval of Arabic language newswire documents from topics in English • 383,872 Arabic documents (896 MB) with SGML markup • 50 topics • Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability http://trec.nist.gov/

CLIR Evaluation – CLEF(Cross Language Evaluation Forum) • Major CLIR evaluation forum • Tracks include • Multilingual retrieval on news collections • topics will be provided in many languages including Hindi • Multiple language Question Answering • ImageCLEF • Cross Language Speech Retrieval • WebCLEF http://www.clef-campaign.org/

Summary • CLIR techniques • Query Translation-based • Document Translation-based • Intermediate Representation-based • Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR • PRF uses a parallel corpus for query translation • Parallel corpora can also be used to estimate cross-lingual relevance models • CLEF and TREC: important CLIR evaluation conferences

References (1) • Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995. • Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998. • A Maximum Coherence Model for Dictionary-Based Cross-Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005. • A Comparative Study of Knowledge-Based Approaches for Cross-Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-3897, University of Maryland, 1998.

References (2) • Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997. • A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005. • Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001. • Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.

Thank You

Cross-Language Information Retrieval (CLIR)