230 likes | 483 Views
N-gram Based Indexing for Marathi Monolingual Search. Ashish Almeida and Pushpak Bhattacharyya IIT Bombay. Introduction. Word based techniques Lexical analysis Morphological Analysis Language dependent Dictionary Spelling normalization Stop-word elimination Multi-word expressions
E N D
N-gram Based Indexing for Marathi Monolingual Search Ashish Almeida and Pushpak Bhattacharyya IIT Bombay
Introduction • Word based techniques • Lexical analysis • Morphological Analysis • Language dependent • Dictionary • Spelling normalization • Stop-word elimination • Multi-word expressions • N-grams • Language independent • Easy to develop
Related Work • Character n-gram Tokenization : McNamee • Significance of n-grams • Use overlapping n-grams for different languages • Tested on HAIRCUT system • Various aspects of n-gram Modeling • Defining generalized n-grams : K. Jarvelin • Defines n-gram and s-gram • Similarity measures for comparing n-grams
Corpus • FIRE 2008 DATA-SET • http://www.isical.ac.in/~fire/ • Documents • 99,275 news articles • Year 2004-2007 • eSakal and Maharashtra Times • queries • 50 training set (from FIRE 2008) • 50 test set (translated from English)
Document <DOC> <DOCNO>MaharashtraC06E811B.txt</DOCNO> <TEXT> घटकचाचणीतसरकारअनुत्तीर्ण सरकारनेआजशंभरदिवसपूर्णकेलेआहेत. ठराविककाळानंतरहोणाऱ्याघटकचाचणीपरीक्षांतआपलीतयारीआजमावण्याचीसंधीविद्यार्थ्यांनामिळते. सरकारच्याकामगिरीचेमूल्यमापनहीत्याचधर्तीवरकरावे, याहेतूनेहेप्रगतिपुस्तकमांडलेआहे. …. </TEXT> </DOC>
Relevance Judgment • 50 new queries translated in Marathi for FIRE 2010 • Query no. 76 – query no. 125 • Marathi • 20,600 document judged for FIRE 2010 • 621 relevant documents • 11 queries have no relevant documents
Why N-grams • Vocabulary increase with #documents in corpus • Dictionary size grows • N-gram based tokens • size is restricted by size of n • n-grams break words • Captures morphological changes • Combine parts of consecutive words • Resilient to spelling errors / spelling variations
Problems with Morph-analyzer • Accuracy of morphological analyzer limited • Can not handle • Unknown words • Unknown suffixes • Not suitable for news domain IR • Many named entities • Computationally heavy
Generating N-grams • For each document Get article text Remove punctuation marks Replace space by ‘_’ Put ‘_’ before and after each sentence Treat address, titles etc. like sentences Do while (there are more than n-1 char left) • Select first n characters as n-gram • Remove the first character from the text End of while
Example: 4-grams • “याफुटबॉलपटुंनाप्रसिद्धीचीगरजआहे.” (These football players deserves fame) • “_या_फुटबॉलपटुंना_प्रसिद्धीची_गरज_आहे_” • 4-grams generated • _या_ , या_फ , ा_फु , _फुट , फुटब , ुटबॉ , ... , गरज_ , रज_आ , ज_आह , _आहे , आहे_ • Length of word • प्रसिद्धी • प+्+र+स+ि+द+्+ध+ी • 9 characters
IR System • Terrier 2 • Open source • Modular • Easy to modify • Unicode ready * • Retrieval models Retrieval models used for evaluation ( Available in Terrier 2 )
Indexing • N-grams • Indexing • Modified tokenizer • Query processing • Index data structure sizes in terrier for • different n-grams
Experiment 1. Baseline • Only word based indexing and retrieval. • No-preprocessing
Experiment : n-grams 2. Using basic n-grams (DFR_BM25) MAP for different length N-grams
Experiment: n-grams 3. Include Small words • During Indexing and retrieval • Identify and brake n-gram overlapping • At text boundaries such as sentence ends, braces, quotation marks, commas.
Combination of N-grams • Choose 2 different length n-grams • Indexing and retrieval • N-gram 1 : >=4 • N-gram 2 : < 4
Conclusion • 4-gram performs best • Balance between precision and recall • Word based MAP : 23.94 % • 4- gram based MAP : 35.79 % • Future work • Analysis of combinations of N-grams • Use skip-grams for Marathi • Experiment based on different word length criteria
References • Paul McNamee and James Mayfield, Character N-gram Tokenization for European Language Text Retrieval, 2004 • Terrier • http://ir.dcs.gla.ac.uk/terrier/ • AnniJarvelin, AnttiJarvelin, KalervoJarvelin, s-grams: Defining generalized n-grams for information retrieval, 2006 • Paul McNamee, Textual Representations for Corpus-Based Bilingual Retrieval, PhD Thesis, 2008