N-gram Based Indexing for Marathi Monolingual Search

N-gram Based Indexing for Marathi Monolingual Search Ashish Almeida and Pushpak Bhattacharyya IIT Bombay

Introduction • Word based techniques • Lexical analysis • Morphological Analysis • Language dependent • Dictionary • Spelling normalization • Stop-word elimination • Multi-word expressions • N-grams • Language independent • Easy to develop

Related Work • Character n-gram Tokenization : McNamee • Significance of n-grams • Use overlapping n-grams for different languages • Tested on HAIRCUT system • Various aspects of n-gram Modeling • Defining generalized n-grams : K. Jarvelin • Defines n-gram and s-gram • Similarity measures for comparing n-grams

Corpus • FIRE 2008 DATA-SET • http://www.isical.ac.in/~fire/ • Documents • 99,275 news articles • Year 2004-2007 • eSakal and Maharashtra Times • queries • 50 training set (from FIRE 2008) • 50 test set (translated from English)

Document <DOC> <DOCNO>MaharashtraC06E811B.txt</DOCNO> <TEXT> घटकचाचणीतसरकारअनुत्तीर्ण सरकारनेआजशंभरदिवसपूर्णकेलेआहेत. ठराविककाळानंतरहोणाऱ्याघटकचाचणीपरीक्षांतआपलीतयारीआजमावण्याचीसंधीविद्यार्थ्यांनामिळते. सरकारच्याकामगिरीचेमूल्यमापनहीत्याचधर्तीवरकरावे, याहेतूनेहेप्रगतिपुस्तकमांडलेआहे. …. </TEXT> </DOC>

Relevance Judgment • 50 new queries translated in Marathi for FIRE 2010 • Query no. 76 – query no. 125 • Marathi • 20,600 document judged for FIRE 2010 • 621 relevant documents • 11 queries have no relevant documents

Why N-grams • Vocabulary increase with #documents in corpus • Dictionary size grows • N-gram based tokens • size is restricted by size of n • n-grams break words • Captures morphological changes • Combine parts of consecutive words • Resilient to spelling errors / spelling variations

Problems with Morph-analyzer • Accuracy of morphological analyzer limited • Can not handle • Unknown words • Unknown suffixes • Not suitable for news domain IR • Many named entities • Computationally heavy

Generating N-grams • For each document Get article text Remove punctuation marks Replace space by ‘_’ Put ‘_’ before and after each sentence Treat address, titles etc. like sentences Do while (there are more than n-1 char left) • Select first n characters as n-gram • Remove the first character from the text End of while

Example: 4-grams • “याफुटबॉलपटुंनाप्रसिद्धीचीगरजआहे.” (These football players deserves fame) • “_या_फुटबॉलपटुंना_प्रसिद्धीची_गरज_आहे_” • 4-grams generated • _या_ , या_फ , ा_फु , _फुट , फुटब , ुटबॉ , ... , गरज_ , रज_आ , ज_आह , _आहे , आहे_ • Length of word • प्रसिद्धी • प+्+र+स+ि+द+्+ध+ी • 9 characters

IR System • Terrier 2 • Open source • Modular • Easy to modify • Unicode ready * • Retrieval models Retrieval models used for evaluation ( Available in Terrier 2 )

Indexing • N-grams • Indexing • Modified tokenizer • Query processing • Index data structure sizes in terrier for • different n-grams

Experiment 1. Baseline • Only word based indexing and retrieval. • No-preprocessing

Experiment : n-grams 2. Using basic n-grams (DFR_BM25) MAP for different length N-grams

Experiment: n-grams 3. Include Small words • During Indexing and retrieval • Identify and brake n-gram overlapping • At text boundaries such as sentence ends, braces, quotation marks, commas.

Effect of length of n in n-gram on MAP (T+D+N)

Combination of N-grams • Choose 2 different length n-grams • Indexing and retrieval • N-gram 1 : >=4 • N-gram 2 : < 4

Conclusion • 4-gram performs best • Balance between precision and recall • Word based MAP : 23.94 % • 4- gram based MAP : 35.79 % • Future work • Analysis of combinations of N-grams • Use skip-grams for Marathi • Experiment based on different word length criteria

References • Paul McNamee and James Mayfield, Character N-gram Tokenization for European Language Text Retrieval, 2004 • Terrier • http://ir.dcs.gla.ac.uk/terrier/ • AnniJarvelin, AnttiJarvelin, KalervoJarvelin, s-grams: Defining generalized n-grams for information retrieval, 2006 • Paul McNamee, Textual Representations for Corpus-Based Bilingual Retrieval, PhD Thesis, 2008

Thank You !

N-gram Based Indexing for Marathi Monolingual Search

N-gram Based Indexing for Marathi Monolingual Search

Presentation Transcript

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

N-gram Models

Graph Search with Indexing

n-gram analysis

Tree-based Indexing

N-Gram Model Formulas

Topic-Dependent-Class-Based N-Gram Language Model

Tree-based Indexing

Indexing in Search Engine

N-gram Models

Search Strategies based on cluster-based indexing and retrieval sophiasearch

Graphic : Nearest Neighbor Search for Distance Based Indexing

N-gram Search Engine on Wikipedia

Marathi – Marathi Monolingual Information Retrieval

N-Gram-based Dynamic Web Page Defacement Validation

N-gram model limitations

N-gram Models

N-Gram Model Formulas

XML Indexing and Search

Taxonomy Based Indexing