1 / 19

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English. Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai. FIRE 2008 – Tamil – English CLIR. Problem Definition Ad-hoc cross-lingual document retrieval task of FIRE.

stormy
Download Presentation

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai

  2. FIRE 2008 – Tamil – English CLIR • Problem Definition • Ad-hoc cross-lingual document retrieval task of FIRE. • The task is to retrieve relevant documents in English for a given Indian language query • worked on Tamil – English cross lingual information retrieval system

  3. Our Approach • The main components in our CLIR system are • Query Language Analyser • Named Entity recognizer • Query Translation engine • Query Expansion • Ranking

  4. Query Language Analyser – Tamil Morphological Analyser • The morphological analyser analyses each word to give the morphs of the word • E.g.: patiwwAnY ->pati(V) + ww (Past) + AnY(3SM) • For nouns, the inflections mark the case such as Dative, accusative • For verbs, the inflections carry information of Person, Number, Gender, tense, aspect and modal • Uses paradigm-based approach • Implemented as Finite State Machine

  5. Named Entity Recognizer (NER) • Generic engine uses Conditional Random Fields (CRFs) • Trained on 100000 word corpus from various domains • Uses a hierarchical tagset • Performs with 80% Recall and Precision 89%

  6. Query Translation • Uses a bilingual dictionary based approach • Tamil – English bilingual dictionary is 150K size • For Named entities, for which transliteration is required, transliteration engine is used. • Tamil to English Transliteration is a tough task • Tamil has few consonants. • Transliteration is done using a statistical system based on n-grams approach • The statistical system works with an accuracy of 81%

  7. Query Expansion • The query terms are expanded using • Thesaurus • Ontology • Query Expansion is done at two places • Before Query translation • After Query translation • Synonyms are obtained using WordNet

  8. Festivals Christian Muslim Hindu Christmas Ramazan Holi Diwali Dussera Query Expansion (2) • Ontology is used to obtain more world knowledge

  9. What is there in the Ontology • Descriptions about the entity • Ex: Holi- Festival of colours, Good over Evil, • Depavali- Festival of Lights , crackers etc • We have an ontology of this type for 100 entities • Festivals, Sports, country, Natural Calamities, Sports, Person Names, etc

  10. Ranking • Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need • A parameter called boost factor is introduced to the standard algorithm for calculating the score • The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

  11. Ranking (2) • This boost factor parameter show the weightage for certain particular terms in the query • NEs get more weightage than other terms, it has been give 0.5 times more weightage • And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

  12. Experiments – Results (1) • We have submitted two runs • For query 29, “assistance after Tsunami”, on expanding the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves” • This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46 • For query ids 27 and 59 the system did not perform well

  13. Experiments – Results (2) • The query 27 “Sino Indian relationship” is too broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined • The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document . The terms “Iraq War” get more weight than the terms “fight against”

  14. Experiments – Results (3) Overall Results of the Tamil – English cross lingual information retrieval system.

  15. Conclusion • Here Query language analyser is used • The difference between two runs MAP score of 0.3921 and 0.4821 • The use of query expansion module helps in increasing the recall • The results obtained are encouraging • MAP – 0.4821 • P@10 – 0.6960 • Recall – 0.8912

  16. References • Mohammad Afraz and Sobha L (2008), ‘English to Dravidian Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India. • Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA. • Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66. • S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

  17. Thank you!

More Related