Overview of RISOT: Retrieval of Indic Script OCR’d Text

Overview of RISOT:Retrieval of Indic Script OCR’d Text Utpal Garain Indian Statistical Institute, Kolkata Tamaltaru Pal Indian Statistical Institute, Kolkata Jiaul Paik Indian Statistical Institute, Kolkata Kripa Ghosh Indian Statistical Institute, Kolkata David Doermann University of Maryland, College Park, USA Douglas W. Oard University of Maryland, College Park, USA

Task • Evaluate retrieval of automatically recognized text from machine printed text • Goals • Support experimentation of retrieval from printed documents • Evaluate IR effectiveness for retrieval based on Indic script OCR • Provide venue where IR and OCR researchers can work together

RISOT 2011 • Bengali newspaper articles • About half the FIRE 2008/2010 collection • 62,875 documents • Text • Rendered image • OCR’d text • 66 topics

RISOT 2011 • Two teams participated • Techniques • OCR error modeling • Query time stemming • Best absolute OCR results resulted from stemming + error modeling • 83% the TEXT MAP for TD queries • Best same-team relative MAP 90% of TEXT • 88% for P@10

Further experiments on RISOT 2011 Data • N-gram statistics were used • Stemming beats words or n-grams • Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model

CLIR • English query Bengali collection (OCR’d) • Dictionary based translation • Transliteration of OOVs • Additional resources • Stemming • OCR error modeling

CLIR Results

Addition in 2012 • Devanagari (Hindi) Dataset • 94,432 articles from two newspaper • Subset of FIRE data • Text • Rendered image • OCR’d • 28 topics • Tasks • OCR Post-processing • Retrieval from Bengali OCR’d text • Retrieval from Devanagari (Hindi) OCR’d Text

RISOT Runs • One team participated • ISI team • KripabandhuGhosh and AnirbanChakraborty • Method • Did not use previous OCR error modeling technique • Assumed that clean text is not available • Co-occurrence based synonym searching • tobacc, 1obacco, etc. are synonyms of tobacco

RISOT Results • OCR error modeling gave better improvement

RISOT Future • Next RISOT will introduce image degradation • Module of OCRopus • LAMP, UMD tool • How to attract more teams • Involvement of OCR consortium • Better OCR • Better error modeling • Summer code projects • Once in two years

Overview of RISOT: Retrieval of Indic Script OCR’d Text

Overview of RISOT: Retrieval of Indic Script OCR’d Text

Presentation Transcript

talk-ppt - PowerPoint Presentation

Overview of Presentation

Overview of Presentation

Transliteration of Indic Scripts

Overview of Presentation

Indic Script Support on the Windows Platform

An Overview of Information Retrieval

Overview of presentation:

Overview of Presentation

Overview of presentation

Imaged Document Text Retrieval without OCR

Overview of Information Retrieval

Title of PowerPoint presentation

Text-retrieval Systems

Overview of presentation

OVERVIEW OF PRESENTATION

Overview of Presentation

OCR 2888 PowerPoint Slides

Overview of Presentation

Overview of Image Retrieval

An Overview of Information Retrieval

Overview of Presentation