Download
iiit hyderabad s clir experiments for fire 2008 n.
Skip this Video
Loading SlideShow in 5 Seconds..
IIIT Hyderabad’s CLIR experiments for FIRE-2008 PowerPoint Presentation
Download Presentation
IIIT Hyderabad’s CLIR experiments for FIRE-2008

IIIT Hyderabad’s CLIR experiments for FIRE-2008

1 Views Download Presentation
Download Presentation

IIIT Hyderabad’s CLIR experiments for FIRE-2008

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & VasudevaVarma IIIT Hyderabad, India

  2. Outline • Introduction • Related Work in Indian Language IR • Our CLIR experiments • Evaluation & Analysis • Future Work IIIT-H @ FIRE-2008

  3. Introduction • Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia) • Information – text, audio, video, speech, geographical information etc IIIT-H @ FIRE-2008

  4. CLIR – Indian languages(IL) scenario To retrieve documents written in any IL when user queries in one language मराठी हिन्दी తెలుగు தமிழ் বাংলা Modified from Source: D. Oard’s Cross-Language IR presentation IIIT-H @ FIRE-2008

  5. Why CLIR for IL? IIIT-H @ FIRE-2008

  6. Why CLIR for IL? IIIT-H @ FIRE-2008

  7. Why CLIR for IL? • Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com • Growth in Indian language contents on the web between 2000 to 2007 – 700% So, CLIR for IL becomes mandatory! IIIT-H @ FIRE-2008

  8. Related Work in Indian Language IR IIIT-H @ FIRE-2008

  9. Related Work in ILIR • ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for English to Hindi and Cebuano “The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003 IIIT-H @ FIRE-2008

  10. Related Work in ILIR • CLEF 2006 - Ad-hoc bi-lingual track including two Indian languages Hindi and Telugu - Our team from IIIT-H participated in Hindi and Telugu to English CLIR task “Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and VasudevaVarma. CLEF 2006. IIIT-H @ FIRE-2008

  11. Related Work in ILIR • CLEF 2007 - Indian language subtask consisting of Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated - Hindi and Telugu to English CLIR “IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and VasudevaVarma. CLEF 2007. IIIT-H @ FIRE-2008

  12. Related Work in ILIR Google’s CLIR system for 34 languages including Hindi IIIT-H @ FIRE-2008

  13. Our clir experiments IIIT-H @ FIRE-2008

  14. Our CLIR experiments • Ad-hoc cross-lingual Hindi to English, and English to Hindi. • Ad-hoc monolingual runs in Hindi and English • 12 runs in total were submitted for the above 4 tasks IIIT-H @ FIRE-2008

  15. Problem statement • CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language <top lang="hi"> <num>28</num> <title>ईरान का परमाणु कार्यक्रम</title> <desc>ईरान का कार्यक्रम और उसकी परमाणु नीति के बारे में विश्व की राय।</desc> <narr>ईरान की परमाणु नीति और ऐसे कार्यक्रम के विरुद्ध ईरान पर यूएसए का निरंतर दबाव और धमकी के बारे में सूचना संबंधित प्रलेख में रहनी चाहिए। परमाणु नीति के समझौते के लिए ईरान और यूरोपीय संघ के बीच वार्ता और विश्व दृष्टि भी रुचिकर होगी</narr> </top> IIIT-H @ FIRE-2008

  16. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  17. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  18. Named entities Identification • Used for identifying the named entities present in the queries for transliteration • We used • Our CRF-based NER system( as a binary classifier) for Hindi queries, • Stanford English NER system for English queries • Identifies Person, Organization and Location names "Experiments in Telugu NER: A Conditional Random Field Approach“,Praneeth M Shishtla, Prasad Pingali, VasudevaVarma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008. IIIT-H @ FIRE-2008

  19. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  20. Query translation • Using bi-lingual lexicons • “Shabdanjali”, an English-Hindi dictionary containing 26,633 entries • IIT Bombay Hindi Wordnet • Manually collected Hindi-English dictionary with 6,685 entries Shabdanjali - http://www.shabdkosh.com/shabdanjali Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/ IIIT-H @ FIRE-2008

  21. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  22. Transliteration • Mapping-based approach • For a given named entity in source language • Derive the Compressed Word Format (CWF) E.g. academia – cdm E.g. abullah - bll • Generate the list of Named entities & their CWFs at the target language side • Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance IIIT-H @ FIRE-2008

  23. Transliteration • Implementation • Named entities present in the Hindi and English corpora are identified and listed. • Their CWFs are generated using a set of heuristic, rewrite and remove rules • CWFs are added to the list of NEs “Named Entity Transliteration for Cross-Language Information Retrieval using Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, UdhyakumarNallasamy. iNEWS-08, CIKM-2008. IIIT-H @ FIRE-2008

  24. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  25. Query Scoring • We generate a Boolean OR query with scored query words • Query scoring is based on • Position of occurrence of the word in the topic • Number of occurrences of the word • Numbers, Years are given greater weights IIIT-H @ FIRE-2008

  26. CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing & Ranking module • Stop word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

  27. Indexing module • For the English corpus, stop words are removed and stemmed using Lucene • For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency • Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene IIIT-H @ FIRE-2008

  28. Evaluation & Analysis IIIT-H @ FIRE-2008

  29. Evaluation • English-Hindi cross-lingual run IIIT-H @ FIRE-2008

  30. Evaluation • Hindi-English cross-lingual run IIIT-H @ FIRE-2008

  31. Evaluation • Hindi-Hindi monolingual run IIIT-H @ FIRE-2008

  32. Evaluation • English-English monolingual run IIIT-H @ FIRE-2008

  33. English-Hindi Vs Hindi-Hindi IIIT-H @ FIRE-2008

  34. Hindi-English Vs English-English IIIT-H @ FIRE-2008

  35. Evaluation • Summary • Our English-Hindi CLIR performance was 58% of the monolingual run • Our Hindi-English CLIR performance was 25% of the monolingual run • Our Hindi-Hindi monolingual run retrieved 52% of total relevant documents • Our English-English monolingual run retrieved 91% of total relevant documents IIIT-H @ FIRE-2008

  36. Analysis • Our English-Hindi CLIR performance can be attributed to factors like • Exact matching of English named entities • Good coverage of English words in our lexicons • A relatively lower performance on Hindi-English CLIR is due to • Low dictionary coverage • Query formulation was not complex enough IIIT-H @ FIRE-2008

  37. Future work IIIT-H @ FIRE-2008

  38. Future Work • Error analysis on per topic basis • Work on more complex query formulations • Work on other possible query translation techniques like • Building dictionaries from parallel corpora • Using web • Using Wikipedia IIIT-H @ FIRE-2008

  39. Thank you!!! IIIT-H @ FIRE-2008