1 / 30

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task. Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad , India. Contents. Introduction Adhoc retrieval task participation

crescent
Download Presentation

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task AvinashYadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India

  2. Contents • Introduction • Adhoc retrieval task participation • Morpheme Extraction Task participation • Conclusion

  3. Introduction • Stemmer • ISMstemmer • Evaluation

  4. Stemmer • Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat • Approaches for Stemming • Language based approach • Statistical approach

  5. ISMstemmer • statistical stemmer • based on suffix extraction • suffix frequency • algorithm

  6. Data Preprocessing Convert the corpus into single file John asked a girl with an apple of Kashmir, “ do you have the time”. She said,“yes”. John asked a girl with an apple of Kashmir do you have the time she said yes John asked girl with apple Kashmir you time she said yes File 1 … File n File 2 Cleaning of data John asked girl with apple Kashmir you time she said yes John asked girl with apple Kashmir you time she said yes Single File John asked a girl with an apple of Kashmir do you have the time she said yes Removing Stop Words Convert file into Single Column

  7. Data preprocessing (contd….) • unique words extracted • Hindi- 4,90,391 • English-7,95,144

  8. Find valid suffixes gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba Reverse the words of single column file 17% de gni niot Sort the reversed list Find suffix according to threshold 40% gni

  9. Threshold used • English: 0.01 - 0.1% • Hindi: 0.1 – 1.0%

  10. Stemming of corpus dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba add agre admitt allott abuild agree ambl angl aborn admitt allott admira activa addi acquisi absorp absolu Stem the reversed words with reversed valid suffixes Reverse stemmed words to get the original words

  11. Note: If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed aging king ag k

  12. Evaluation of ISMstemmer • For evaluation of ISMstemmer we have participated in: • Monolingual Adhoc retrieval task in English and Hindi Languages • Morpheme Extraction Task (MET) of FIRE-2012

  13. Adhoc Retrieval Task(ART) Participation • Monolingual task • Languages chosen: • English • Approach • Results • Hindi • Approach • Results

  14. ART: English Approach: • Indexing: • Search Engine used: Indri(IndriBuildIndex) • Retrieval: • Search engine used: Lemur (RetEval) • Data Provided: • Corpus from The Telegraph and BD News • 50 query set

  15. ART: English (contd….) • Results:

  16. ART: Hindi Approach: • Indexing: • Search Engine used: Indri (IndriBuildIndex) • Retrieval: • Search Engine used: Indri (IndriRunQuery) • Data Provided: • Corpus from Navbharat Times and AmarUjala • 50 query set

  17. ART: Hindi (contd….) • Results:

  18. Morpheme Extraction Task Participation • Tool submitted • Results

  19. MET Tool Submission. • ISMstemmer submitted • evaluated at IR Labs: DAIICT, Gujarat • tested on 6 languages of South Asian origin • has given efficient results with 3 languages

  20. MET Results: • BENGALI Institute Language MAP Obtained Baseline Bengali 0.2740 JU Bengali 0.3307 DCU Bengali 0.3300 IIT-KGP Bengali 0.3225 CVPR-Team1 Bengali 0.3159 ISM Bengali 0.3103   CVPR-Team2+  Bengali NA

  21. MET Results (contd….) 2. GUJARATI Institute Language MAP Obtained Baseline Gujarati 0.2677 ISM Gujarati 0.2824 3. MARATHI Institute Language MAP Obtained Baseline Marathi 0.2320 ISM Marathi 0.2797 IIT-B Marathi 0.2684

  22. MET Results (contd….) 4. ODIA Institute Language MAP Obtained Baseline Odia 0.1537 IIIT-BhOdia0.1537 ISM Odia 0.1537 5. HINDI Institute Language MAP Obtained Baseline Hindi 0.2821 DCU Hindi 0.2963 ISM Hindi 0.2793

  23. MET Results (contd….) 6. TAMIL Institute Language MAP Obtained Baseline Tamil NA AUCEG Tamil NA ISM Tamil NA NA : results are not available, due non-availability of qrels

  24. Reasons for Underperformance with Hindi • overstemming • undesired stemming of proper nouns

  25. Overstemming • This refers to words that shouldn’t be grouped together by stemming, but are. Example – • accent, accentual, accentuate Stem word – accent • accept, acceptant, acceptor Stem word – accept • access, accessible, accession Stem word – access • due to overstemming it may be possible that these all group into wrong stem - acce

  26. Undesired stemming of proper nouns • proper nouns should not be stemmed as they are not inflected Example – Beijing It will get stemmed to Beij

  27. Conclusion ART: • English: not satisfactory Hindi: poor Reasons: • overstemming • undesired stemming of proper nouns MET: • performed efficiently with Bengali, Gujarati and Marathi languages • performed up to the mark with Odia • underperformed with Hindi

  28. References 1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata. 2. www.isical.ac.in/~fire/ (as on 06.12.2012) 3. Christopher D. Manning, HinrichSchütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9. 4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012) 5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012) 6. www.lemurproject.org (as on 06.12.2012) 7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

  29. References (contd…) 8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011). 9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China. 10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81. 11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012) 12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

  30. THANK YOU!!

More Related