1 / 32

Cross-Language Information Retrieval (CLIR)

Cross-Language Information Retrieval (CLIR). Cross Language Information Retrieval (CLIR). “A subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query.” E.g., Using Hindi queries to retrieve English documents

monreal
Download Presentation

Cross-Language Information Retrieval (CLIR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Language Information Retrieval (CLIR) Natural Language Processing/Language Technology for the Web

  2. Cross Language Information Retrieval(CLIR) “A subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query.” E.g., Using Hindi queries to retrieve English documents Also called multi-lingual, cross-lingual, or trans-lingual IR.

  3. Why CLIR? E.g., On the web, we have: • Documents in different languages • Multilingual documents • Images with captions in different languages A single query should retrieve all such resources.

  4. Approaches to CLIR most efficient; commonly used infeasible for large collections Most effective approaches are hybrid–a combination of knowledge and corpus-based methods.

  5. Dictionary-based Query Translation • phrase identification • words to be transliterated आयरलैंड शांति वार्ता Hindi-English dictionaries search Collection Ireland peace talks

  6. The problem with dictionary-based CLIR -- ambiguity

  7. … filtering/disambiguation is required after query translation.

  8. Disambiguation using co-occurrence statistics Hypothesis: correct translations of query terms will co-occur and incorrect translations will tend not to co-occur

  9. Problem with counting co-occurrences: data sparsity freq(Marathi Shallow Parsing CRFs) freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs) … are all zero. How do we choose between parsing, structuring, and analyzing?

  10. Pair-wise co-occurrence अंतरिक्षीय घटना cosmic outer-space incident event occurrence lessen subside decrease lower diminish ebb decline reduce freq(cosmic incident)  70800 freq(cosmic event  269000 freq(cosmic lessen)  7130 freq(cosmic subside)  3120 freq(outer-space incident)  26100 freq(outer-space event)  104000 freq(outer-space lessen)  2600 freq(outer-space subside)  980

  11. Shallow Parsing, Structuring or Analyzing? shallow parsing  166000 shallow structuring  180000 shallow analyzing  1230000 CRFs parsing  540 CRFs structuring  125 CRFs analyzing  765 Marathi parsing  17100 Marathi structuring  511 Marathi analyzing  12200 “shallow parsing”  40700 “shallow structuring”  11 “shallow analyzing”  2 But, analyzing  74100000 parsing  40400000 structuring  17400000 shallow  33300000 collocation?

  12. Ranking senses using co-occurrence statistics • Use co-occurrence scores to calculate similarity between two words: sim(x, y) • Point-wise mutual information (PMI) • Dice coefficient • PMI-IR

  13. Disambiguation algorithm

  14. Example अंतरिक्षीय घटना cosmic outer-space incident event lessen subside decrease lower diminish ebb decline reduce score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …

  15. Disambiguation algorithm: sample outputs

  16. Results on TREC8 (disks 4 and 5) • English topics (401-450) manually translated to Hindi • Assumption: relevance judgments for English topics hold for the translated queries • Results (all TF-IDF):

  17. Pseudo-Relevance Feedback for CLIR

  18. (User) Relevance Feedback (mono-lingual) • Retrieve documents using the user’s query • The user marks relevant documents • Choose the top N terms from these documents • Top terms  IDF is one option for scoring • Add these N terms to the user’s query to form a new query • Use this new query to retrieve a new set of documents

  19. Pseudo-Relevance Feedback (PRF) (mono-lingual) • Retrieve documents using the user’s query • Assume that the top M documents retrieved are relevant • Choose the top N terms from these M documents • Add these N terms to the user’s query to form a new query • Use this new query to retrieve a new set of documents

  20. PRF for CLIRCorpus-based Query Translation • Uses a parallel corpus of documents: H1 E1 H2  E2 . . . . . . Hm Em Hindi collection H English collection E

  21. PRF for CLIR • Retrieve documents in H using the user’s query • Assume that the top M documents retrieved are relevant • Select the M documents in E that are aligned to the top M retrieved documents • Choose the top N terms from these documents • These N terms are the translated query • Use this query to retrieve from the target collection(which is in the same language as E)

  22. Cross-Lingual Relevance Models - Estimate relevance models using a parallel corpus

  23. Ranking with Relevance Models • Relevance model or Query model (distribution encodes the information need): • Probability of word occurrence in a relevant document • Probability of word occurrence in the candidate document • Ranking function (relative entropy or KL divergence)

  24. Estimating Mono-Lingual Relevance Models

  25. Estimating Cross-Lingual Relevance Models

  26. CLIR Evaluation – TREC(Text REtrieval Conference) • TREC CLIR track (2001 and 2002) • Retrieval of Arabic language newswire documents from topics in English • 383,872 Arabic documents (896 MB) with SGML markup • 50 topics • Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability http://trec.nist.gov/

  27. CLIR Evaluation – CLEF(Cross Language Evaluation Forum) • Major CLIR evaluation forum • Tracks include • Multilingual retrieval on news collections • topics will be provided in many languages including Hindi • Multiple language Question Answering • ImageCLEF • Cross Language Speech Retrieval • WebCLEF http://www.clef-campaign.org/

  28. Summary • CLIR techniques • Query Translation-based • Document Translation-based • Intermediate Representation-based • Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR • PRF uses a parallel corpus for query translation • Parallel corpora can also be used to estimate cross-lingual relevance models • CLEF and TREC: important CLIR evaluation conferences

  29. References (1) • Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995. • Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998. • A Maximum Coherence Model for Dictionary-Based Cross-Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005. • A Comparative Study of Knowledge-Based Approaches for Cross-Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-3897, University of Maryland, 1998.

  30. References (2) • Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997. • A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005. • Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001. • Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.

  31. Thank You

More Related