Recognition and Retrieval from Document Image Collections Million Meshesha (Roll No.: 200299004) Advisor: Dr. C. V. Jawahar Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India
Introduction • Global effort to digitize and archive large collection of multimedia data • Most of them are printed books • Emergence of large Digital Libraries like UDL, DLI, etc. • One million book archival activities at Mega Scanning center – IIIT-H • Involvement of Google, Yahoo, Microsoft in massive digitization project • The aim of digitization is for easier preservation and make documents freely accessible to the globe. Needs to design efficient means of access to the content.
The Direct Approach • Recognition-based access to documents • Easy to integrate into a standard IR framework • Success of text image retrieval mainly depends on the performance of OCRs Optical Character Recognition Preprocessing and Segmentation Document Images Post-processing Feature Extraction Classification Text Documents Database Search engine Cross lingual Textual Query Retrieval Text Documents
Challenges • The state-of-the-art OCR engines recognize documents printed in Latin and some Oriental scripts • with few errors in each page for high quality images • Unavailability of robust OCRs for indigenous scripts of African and Indian languages. • Challenges in developing OCRs for scripts with complex shape and large number of characters. • Lack of specialized recognizers for large document image collections. • Diversity and quantity of documents archived in digital libraries.
Alternate Approach: Recognition-Free Preprocessing and Segmentation Clustering and Indexing Feature Extraction Document Images Database Search engine Cross Lingual Textual Query Document Images Retrieval Rendering
Comparison of the Two Approaches Recognition-basedRecognition-free Needs recognition before Retrieve without explicit retrieval recognition e.g. Text search engines e.g. CBIR, CBVR Less offline processing High offline processing (excluding recognition) Fast and efficient algorithms Slow & inefficient schemes Compact representation Bulky representation Content/language More of content/language dependent independent Challenging to build Relatively easy to build with (because of recognizers) certain level of acceptable performance
Review of OCR Systems • Conventional OCRs follow sequential steps: Preprocessing Bayesian statistical classifier SVM classifier Neural Network Structural Features like Shape, contour etc. Transformation Domain Features like DFT, DCT Global and Local Features Thresholding Normalization Skew Detection/ Correction Noise Removal Algorithms Line Segmentation Word Segmentation Component Analysis Text/Image Block identification Geometric Layout Analysis Lexical Information Dictionary and Punctuation Rules Statistical Information Document Layout Analysis Segmentation Feature Extraction Classification Post Processing “Anatomy of a Versatile Page Reader“, H.Baird, Proc. of IEEE, Vol. 80, no.7, July,1992. “Omnidocument Technologies”, IM. Bokser, Proc. of IEEE, Vol 80, no.7, July,1992
Review of Recognition-Free • Manmatha et al: • Proposed the word spotting idea for matching word images from handwritten historical manuscripts. • Used dynamic time warping (DTW) for word image matching. • Selected profile features for matching handwritten word images. • Chaudhury et al.: • Exploited the structural characteristics of the Indian scripts to access them at word level. • Employed geometric features, and suffix trees for indexing. • Trenkle and Vogt: • Experimented on word level image matching. • Extracted features at the baseline, concavities, line segments, junctions, dots and stroke directions and computed a distance metric. • Srihari et al.: • Spotting words from document images of Devanagari, Arabic and Latin. • Used Gradient, Structural and Concavity (GSC) features. • Implement correlation similarity measure for word spotting. • AK Jain and Anoop M. Namboodiri: • Employed DTW based word-spotting for Indexing and retrieval of on-line documents. • Extract features such as the height of the sample point, direction and curvature of strokes. Santanu Chaudhury, Geetika Sethi, Anand Vyas and Gaurav Harit, "Devising Interactive Access Techniques for Indian Language Document Images", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, Pp. 885-889 S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts," Vivek: Indian Journal of Artificial Intelligence A.K. Jain and Anoop M. Namboodiri, "Indexing and Retrieval of On-line Handwritten Documents", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 655-659 J. M. Trenkle and R. C. Vogt, "Word Recognition for Information Retrieval in the Image Domain", Symposium on Document Analysis and Information Retrieval, pp. 105-122, 1993. T. Rath and R. Manmatha, "Word Image Matching Using Dynamic Time Warping", Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2, pp. 521--527, 2003.
Major Contributions • Study indigenous African scripts for document understanding • First attempt to introduce the challenges toward the recognition and retrieval of indigenous African scripts. • Design an OCR for recognizing Amharic printed documents • test on real-life document images (books, magazines and newspapers). • Propose an architecture of self adaptable book recognizer • demonstrate its application on document images of book. • Propose an efficient matching and feature extraction schemes • Performance analysis on datasets of word-form variants, degradations and printing variations in word images. • Construct an indexing scheme by applying IR principles for efficient searching in document images. • experiment its efficiency on document images of book and newspapers.
African Scripts • Africa is the 2nd largest continent in the world, next to Asia. • There are around 2500 languages spoken in Africa, which are either: • Installed by conquerors of the past and use a modification of the Latin and Arabic scripts. • Indigenous languages with their own scripts. E.g. Amharic (Ethiopia), Vai (West Africa), Bassa (Liberia), Mende (Sierra Leone), etc. • Document image analysis and understanding research is very limited for indigenous African scripts. • Few attempts are available for Amharic scripts. • Other indigenous scripts are not yet studied Most are not used as official languages Their existence is not known by most researchers Characters are complex in shape Bassa script Vai script Mende script
Amharic Language/Script • Large number of characters • More than 300 characters • Vowel formation • Existence of visually similar characters • Frequently occurring characters • Amharic word morphology • Have rich word morphology • Amharic (like Hindi) is verb-final language, modifiers usually precede the nouns they modify. • the word order in English sentences: Subject-Verb-Object • the word order in Amharic and Hindi is Subject-Object-Verb
1,4 1,3 2,4 1,2 2,3 3,4 3 4 1 2 Recognition from “A” Document Image Amharic OCR is developed on top of an OCR for Indian Languages. • Preprocessing • Binarization: • Convert gray pixels into binary. • Skew detection and correction: • Ensure that the page is aligned properly • Noise removal • Remove artifacts in the image • Segmentation • Line segmentation • Identify lines in a text. • Word segmentation • Identify words in a text line. • Character segmentation • Detect each character from segmented word. • Feature extraction • Consider the entire component image as a feature. • PCA • Used for dimensionality reduction. • Reduces to character/ connected components sub-space. • LDA • Extracts optimal discriminant vector and reduces to classification sub-space • Classification • DDAG based architecture for multi-class SVMs. • Support Vector Machines (SVMs) at each node. Consider characters and D. H. Foley and J. W. Sammon. An optimal set of discriminant vectors. IEEE Trans. on Computing, 24:271-278, 1975. C. V. Jawahar, MNSSK Pavan Kumar, SS Ravi Kiran: A Bilingual OCR for Hindi-Telugu Documents and its Applications. ICDAR 2003: 408-412
Experimental Results Blob Cut Merge
Comments • Present day OCRs do not improve the performance over time. • Performance on the first and last pages of the book are statistically identical. • OCRs are designed to convert a single document image into a textual representation. • Omni-font OCRs are rare even for English. • Performance degrades with quality, unseen fonts, etc. OCR for a collection (e.g. book) has to be different from OCR designed for an isolated single page. Can we design a recognizer for document image collections; say, Book recognizer ?
Our Strategy • Enable OCR learn from its experience through feedback at normal operation that comes from postprocessor. • The conventional open-loop system of classifier followed by post-processor is closed. • Learns from both correctly classified and misclassified examples. • Extends knowledge gained from one page to other pages • Iterates and perfects on a page (a set of pages). • Improves its performance over time to varying document image collections in fonts, sizes and styles, Quality Apply machine learning procedures to build an intelligent OCR
Comparison Conventional OCRs • Designed for a single page • No feedback; top-down serial process • Failures are costly: any error at intermediate level results in wrong output of system • Offline training • Performance declines or static Our new approach, Book recognizer • Designed for multiple pages • Feedback based flexible design • Any error at an intermediate level can be corrected by using proper feedback. • More of online learning • Performance improves overtime
Self adaptable OCR Design lntormatlon iold lnformation told Recognized Texts Recognizer Document Images • Produces error-corrected words. • Such words are candidate for feedback Post Processor Model Base Model • Incremental learning Refined Samples Classifier lnformation idol Selected Samples Rejected Samples Validator • Pass new samples for training Sampler i Filtered samples Samples Labeler dol • Detection of outliers • Validation in image space Labeled samples Sample Database lnformation • Label unlabelled data … • Add samples to their proper class
Learning online Initial accuracy = 65.24% Final accuracy = 95.26% 2nd iteration accuracy = 88.24% More iteration accuracy = 94.82% More iteration accuracy = 91.08% • Experiment on poor quality book • Initial accuracy was less than 70% • a very low accuracy was obtained • Within few iterations of learning, the recognition accuracy improved near to 96%.
Further Issue • OCR is a long-term solution. • Needs some time to come up with a workable system. • But our problem is immediate. • A number of documents are already archived and ready for use. Can we access the content of document images without explicit recognition?
Word Spotting CollectionQueryMatching Score Professor University 10.38 Alexander University 14.44 Smith University 12.21 until University 9.32 recently University 16.43 head University 17.34 chemistry University 14.56 Columbia University 15.10 University University 0.51 American University 18.71 Chemical University 14.32 Society University 12.13 died University 19.11 native University 18.10
Word Search by Word Spotting Query Christian Render Feature Extraction Matching
Efficient Matching Scheme • Matching techniques: • Cross Correlation • Dynamic Time Warping (DTW) • Aligns and finds the best match between pairs of word images with different size. • Trace back to identify the optimal warping path (OWP) Performance analysis shows that DTW outperforms Cross correlation
Challenges in Word Image Search • Degradation of documents • Cuts, blobs, salt and pepper, erosion of border pixels, etc. • Print variations • A word image may vary in size, style, font and quality. • Morphological variation • A word may have different variants.
“Stemming” of Word Images • Two possible variants of a word: • formed by adding prefix and/or suffix to the root word), e.g. 'connect‘ ‘connects', ‘connecting', 'reconnect‘… • synonymous words. E.g. ‘connect‘ ‘join', ‘attach‘ … • It is observed that most of the word form variations takes place either at the beginning or at the end. • Needs matching algorithm which can “penalize” mismatches in the beginning or at the end. Propose a novel DTW-based partial matching scheme
DTW-based Morphological Matcher Partition OWP (with length L) into beginning, middle and end regions of length k (L/3) each for i = 1 to k do if there is matching cost concentration at the beginning reduce extra cost from the total matching score else break. end for for i = L down to 2k do if there is matching cost concentration in the end reduce extra cost from the total matching score else break end for Normalize the matching score by the length of the optimal warping path.
Degraded Words Complex script Salt and Pepper Blobs Cuts Historic documents
Degradation Modeling • Cuts and breaks • Blobs • Salt and pepper • Erosion of boundary pixels We built datasets using our degradation models for English, Hindi and Amharic.
Invariant Feature Selection • Investigate various features: • Profiles (upper, lower, projection, transition) • Statistical moments (mean, standard deviation, skew) • Region-based moments (zero-order moment, first-order moment, central moment) • Transform Fourier representations • Global vs. Local Features • Global features: compute a single value. • Local features: compute 1D representation following vertical strips of a word. • Local features perform better than global features • For better performance combine local features of profiles, moments and transform domain representations
Invariant Feature Selection • To test the performance of combined features the DTW matching algorithm is modified • Combined local features of profiles and moments are invariant to degradations and printing variations.
Information Retrieval from “Document Images” • Users expect more than just searching for documents that contain their query word. • Expectation for the popularity of text search. • Retrieve relevant documents in ranked order. • Remove effects of stopwords in the retrieval process. • Fast search and efficient delivery of documents. • How can we meet users requirements? Construct an indexing scheme to organize word images following IR principles.
Indexing Document Images IR Measures and Clustering Word Images Stopword Detection Stemming Relevance Measure Template (Keywords) • Index terms Inverted Indexing Index list
Clustered English Words • Clustered words vary in: • Fonts • Sizes • Styles • Forms • Quality
Test results on datasets of the various fonts, sizes and styles PowerGeez VisualGeez Agafari Alpas Normal Bold Italic 10 12 14 16
Performance: Precision vs. Recall graph • The graph shows effectiveness of our scheme • it increases both precision and recall by moving the entire curve up and out to the right.
Concluding Remarks • African scripts • Introduce for the first time indigenous African scripts • Initial attempt to recognize Amharic documents with good results to extend it to other indigenous African scripts. • Needs engineering effort to make it applicable for real-life situations • Recognizer design • New attempt to propose self-adaptable recognizer for document image collections with the help of machine learning algorithms • Encouraging results for developing recognizer for large document image collections • Further work is needed for extending the framework to many of the complex Indian and African scripts • Document image indexing and Retrieval • Propose DTW-based partial matching scheme to perform morphological matching • Design invariant feature extraction scheme to degradation and printing variations • Apply IR principles, and construct clustering and indexing scheme. • Needs solving system related issues for practical online retrieval from large corpus Million Meshesha and C. V. Jawahar, “Matching Word Images for Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press). Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007. Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007. Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted). Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007.
Scope for Future Work • Develop an online system for searching hundreds of books over the Web • Recognition and retrieval of complex documents (such as camera-based, handwritten, etc.). • Apply advanced image preprocessing techniques to enhance image quality for large collection of document images. • Retrieval of documents in presence of OCR errors and scope for hybrid approaches.
Publications: Conference Papers • Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007. • A. Balasubramanian, Million Meshesha, C. V. Jawahar, “Retrieval from Document Image Collections", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 1-12. • Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indiraneel Deb Sikdar, A. Balasubramanian and C. V. Jawahar, “Semi-automatic Adaptive OCR for Digital Libraries", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 13-24. • K. Pramod Sankar, Million Meshesha, C. V. Jawahar, “Annotation of Images and Videos based on Textual Content without OCR", In Workshop on Computation Intensive Methods for Computer Vision, Part of 9th European Conference on Computer Vision (ECCV), Austria, 2006. • Million Meshesha and C. V. Jawahar, “Recognition of Printed Amharic Documents", In Proceedings of 8th International Conference of Document Analysis and Recognition (ICDAR), Seoul, Korea, Sep 2005, Volume 1, pp 784-788 • C. V. Jawahar, Million Meshesha, A. Balasubramanian, “Searching in Document Images", In Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), 2004, pp. 622-627.
Publications: Journal Articles • Million Meshesha and C. V. Jawahar, “Matching Word Images for Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press). • C. V. Jawahar, A. Balasubrahmanian, Million Meshesha and Anoop Namboodiri, “Retrieval of Online Handwriting by Synthesis and Matching", Pattern Recognition (in press). • Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007. • Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007. • Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted).