AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai

FIRE 2008 – Tamil – English CLIR • Problem Definition • Ad-hoc cross-lingual document retrieval task of FIRE. • The task is to retrieve relevant documents in English for a given Indian language query • worked on Tamil – English cross lingual information retrieval system

Our Approach • The main components in our CLIR system are • Query Language Analyser • Named Entity recognizer • Query Translation engine • Query Expansion • Ranking

Query Language Analyser – Tamil Morphological Analyser • The morphological analyser analyses each word to give the morphs of the word • E.g.: patiwwAnY ->pati(V) + ww (Past) + AnY(3SM) • For nouns, the inflections mark the case such as Dative, accusative • For verbs, the inflections carry information of Person, Number, Gender, tense, aspect and modal • Uses paradigm-based approach • Implemented as Finite State Machine

Named Entity Recognizer (NER) • Generic engine uses Conditional Random Fields (CRFs) • Trained on 100000 word corpus from various domains • Uses a hierarchical tagset • Performs with 80% Recall and Precision 89%

Query Translation • Uses a bilingual dictionary based approach • Tamil – English bilingual dictionary is 150K size • For Named entities, for which transliteration is required, transliteration engine is used. • Tamil to English Transliteration is a tough task • Tamil has few consonants. • Transliteration is done using a statistical system based on n-grams approach • The statistical system works with an accuracy of 81%

Query Expansion • The query terms are expanded using • Thesaurus • Ontology • Query Expansion is done at two places • Before Query translation • After Query translation • Synonyms are obtained using WordNet

Festivals Christian Muslim Hindu Christmas Ramazan Holi Diwali Dussera Query Expansion (2) • Ontology is used to obtain more world knowledge

What is there in the Ontology • Descriptions about the entity • Ex: Holi- Festival of colours, Good over Evil, • Depavali- Festival of Lights , crackers etc • We have an ontology of this type for 100 entities • Festivals, Sports, country, Natural Calamities, Sports, Person Names, etc

Ranking • Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need • A parameter called boost factor is introduced to the standard algorithm for calculating the score • The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

Ranking (2) • This boost factor parameter show the weightage for certain particular terms in the query • NEs get more weightage than other terms, it has been give 0.5 times more weightage • And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

Experiments – Results (1) • We have submitted two runs • For query 29, “assistance after Tsunami”, on expanding the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves” • This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46 • For query ids 27 and 59 the system did not perform well

Experiments – Results (2) • The query 27 “Sino Indian relationship” is too broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined • The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document . The terms “Iraq War” get more weight than the terms “fight against”

Experiments – Results (3) Overall Results of the Tamil – English cross lingual information retrieval system.

Conclusion • Here Query language analyser is used • The difference between two runs MAP score of 0.3921 and 0.4821 • The use of query expansion module helps in increasing the recall • The results obtained are encouraging • MAP – 0.4821 • P@10 – 0.6960 • Recall – 0.8912

References • Mohammad Afraz and Sobha L (2008), ‘English to Dravidian Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India. • Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA. • Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66. • S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

Thank you!

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

Presentation Transcript

Cross-lingual projection of Semantics

Cross-lingual Information Access by Natural Language

Cross-Lingual IR

Cross-Language Information Retrieval

Cross-Language Information Retrieval

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

Cross Lingual Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services

CLEF-2007 Cross-Language Speech Retrieval Track Overview

The Cross Language Image Retrieval Track ImageCLEF 2009

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Evaluating Cross-language Information Retrieval Systems

The CLEF 2005 Cross-Language Image Retrieval Track

Cross-lingual Information Extraction System Evaluation

The Cross Language Image Retrieval Track: ImageCLEF 2006

Vietnamese-English Cross Language Search Information Retrieval (CLIR) -

Cross-Language Information Retrieval (CLIR)