1 / 20

IITB @ FIRE 2010: Discriminative Models for IR

IITB @ FIRE 2010: Discriminative Models for IR. Vishal Vachhani, Shalini Gupta Joint work with Manoj Chinnakotla, Karthik Raman, Pushpak Bhattacharyya 20 th February, 2010. Roadmap. Introduction Ranking Models for IR Discriminative Models for IR Features for RankSVM Training & PRF

dunn
Download Presentation

IITB @ FIRE 2010: Discriminative Models for IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IITB @ FIRE 2010: Discriminative Models for IR Vishal Vachhani, Shalini Gupta Joint work with Manoj Chinnakotla, Karthik Raman, Pushpak Bhattacharyya 20th February, 2010

  2. Roadmap • Introduction • Ranking Models for IR • Discriminative Models for IR • Features for RankSVM • Training & PRF • Cross Lingual IR • Experiments & Results • Conclusions

  3. The Central Problem of IR • User has information need • Expresses in terms of short and often-ambiguous query • Objective of IR system • Match user query with relevant documents • Rank the documents in order of relevance

  4. Ranking Models for IR • Vector Space Models • Probabilistic Models • Language Models

  5. Why Discriminative Models? • Arbitrary Features • Optimization for specific Evaluation Metrics • Hence, Discriminative Models for IR.

  6. Features for RankSVM • Statistical Features • Term Frequency • Normalized Term Frequency • Inverse Document Frequency

  7. Features for RankSVM (..contd)‏ • Normalized Cumulative Frequency • Normalized Term Frequency, weighted by IDF • Normalized Term Frequency, weighted by cumulative frequency

  8. Features for RankSVM (..contd)‏ • Co-ord Factor • Ratio of number of overlapping words to the total number of query words • Named Entity features • Capture the importance of NE in query • Pivoted length Normalization • LSI similarity • Captures Concept Based Similarity

  9. Training • Training data in the form of ranked documents • Top preference assigned to relevant documents from relevance judgements of training queries • Add k1 random negative documents from relevance judgements • Add close irrelevant documents from • Get initial ranking (top 1K) using simple query likelihood ranking • Top k2 irrelevant documents of initial ranking • Bottom k3 irrelevant documents of initial ranking

  10. Pseudo Relevance Feedback(PRF) • We use Zhai and Lafferty[11] PRF method in our experiments • Terms from the Top 10 documents of initial ranking were used for relevance feedback terms • Updating the features vector Vs expanding the features vector • Updating the features vector gives better accuracy

  11. Cross Lingual IR • Hindi-English and Marathi-English Query Translation based CLIR • Transliteration • Segment based Transliteration approach for Devanagari to English • Accuracy : 71.8 % at rank -5 • Resources • Hindi-English Dictionary: 1,31,750 • Marathi-English Dictionary: 31,845

  12. Query Disambiguation • Disambiguation algorithm • Page rank style iterative disambiguation [Christof Monz et al.] • FIRE-2008: • Link weight: Dice Coefficient • 1 candidate translation/ query word • FIRE-2010 • Link weight: similarity in LSI space • LSI establish associations between those terms that occur in similar contexts • 2 candidates translation/query word

  13. Experiments • Monolingual: English, Hindi, and Marathi • CLIR: Hindi-English and Marathi-English • Training • 50 test queries of FIRE-2008 • SVM, Rank-SVM • Runs: • Monolingual and Cross lingual without feedback • Monolingual and Cross lingual with feedback • Runs with Title, Title+ Desc

  14. Monolingual Results

  15. Cross-Lingual Results

  16. Conclusions • Improvement in Cross-lingual IR due to improvement in transliteration and query disambiguation • About 88% of monolingual retrieval • Need further investigation on • monolingual baselines – Hindi, Marathi and English • Performance of PRF in Discriminative IR model • Effect of various features like NE, LSI based similarity, document length normalization

  17. Acknowledgements • The first author is supported by the Infosys Fellowship Award • Project linguists at CFILT, IIT Bombay

  18. References [1] M. K. Chinnakotla and O. P. Damani. Character sequence modeling for transliteration. In ICON 2009, IITB, India, 2009. [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391-407, 1990. [3] T. Joachims. Optimizing search engines using click through data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133{142, New York, NY, USA, 2002. ACM. [4] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008. [5] C. Monz and B. J. Dorr. Iterative translation disambiguation for cross-language information retrieval. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 520-527, New York, USA, 2005. ACM. [6] R. Nallapati. Discriminative models for information retrieval, 2004.

  19. References (Contd..) [7] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006. [8] N. Padariya, M. Chinnakotla, A. Nagesh, and O. P. Damani. Evaluation of hindi to english, marathi to english and english to hindi clir at re 2008. In FIRE 2008, IITB, India, 2008. [9] A. Singhal, C. Buckley, M. Mitra, and A. Mitra. Pivoted document length normalization. pages 21-29. [10] D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/. [11] C. Zhai and J. Laerty. Model-based feedback in the language modeling approach to information retrieval. In CIKM '01: Proceedings of the tenth international conference on Information and knowledge management, pages 403{410, New York, NY, USA, 2001. ACM.

  20. Thank You

More Related