Generating Synthetic Datasets for Evaluating OCR Performance in Text/Graphics Documents

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR Mathieu Delalandre CVC, Barcelona, Spain DAG Meeting CVC, Barcelona, Spain Wednesday 19th of November 2008

Text/graphics documents Introduction Text/graphics documents are used in a variety of fields like geography, engineering, social sciences … Some examples are architectural drawing utility map geographic map Huge amount of data exist, two main sources web images digitized documents (modern and old)

Introduction • OCR of text/graphics documents Character recognition system working with text/graphics documents # First related work [Brown’1979] # More than 50 references on this topic today [Fletcher’1988] [Zenzo’1992] [Goto’1999] [Adam’2000] … Text/Graphics separation full image of text-lines Problematics - letter segmentation - multi-font recognition - scale variation - text/graphics separation - rotation variation - text-line detection - no reading order - no dictionary Text-line detection general to any documents images of single text-line Character segmentation specific to text/graphics documents images of single character Character recognition ASCII

Introduction System Groundtruthing Results Results Results Characterisation Groundtruth Groundtruth Performance evaluation Groundtruth The case of general OCR [Kanungo’1999] More than 40 references on the topic [Kanungo’1999] Several standard databases exist (NIST, MARS, CD-ROM English, …) Annual evaluation reports [Rice’1992] [Rice’1993] Black-box evaluation:The evaluation considers the OCR system as an indivisible unit and evaluates it from its final results (i.e. OCR output vs. ASCII transcription of the text using string edit distances). White-box evaluation:The evaluation aims to characterize the performance of individual sub-modules of the OCR system (skewing, letter segmentation, block identification, character recognition, etc.). • About performance evaluation Results Documents Documents The case of text/graphic document OCR [Wenyin’1997] Only 1 reference on the topic No standard databases None complete evaluation done through 20 years of research

Introduction • Scope of the proposed work Performance evaluation of text/graphics document OCR # white-box evaluation # groundtruthing step # datasets for text/line detection and character recognition # generation algorithms are “simple”, the main purpose of the talk will concern the setting contributions

Plan • Groundtruth definition • Datasets for character recognition • Datasets for text-line detection • In progress datasets

Groundtruth definition 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Character level • ASCII code • font (name, size, style) • location point • orientated bounding box • orientation (ϴ) • scale () • Text level • first location point • groundtruth of characters • characters/word positions

Datasets for character recognition (1/2) 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Problematics • Published experiments • Main conclusions How to generate single character images ? Which number of class ? Which image resolution ? Which size for the datasets ? Which fonts ? Etc …. • The real sizes of characters can be only estimated. • The confusion problem (e.g. 6 vs 9) is not still well defined, the 62 class problem (a-z A-Z 0-9) is the main goal. • It is not possible to fix a standard size for the training/test sets, this information is still well defined, several thousands of images are mandatory for the training. • The impact of fonts is few studied and should be take into account in the evaluation • The invariance to rotation and scaling is the final goal, they are few studied independently.

Datasets for character recognition (2/2) 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Datasets • Generation setting • Generation algorithm font manager, centering, scale and rotation processes Geometry invariance Font adequacy Font scalability 15 000 +30 000 + 45 000 + 60 000

Datasets for text-line detection (1/2) 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Problematics • Main conclusions How to generate single character images ? Which number of word per image ? Which image size ? Which size for the datasets ? Which number of font ? Etc …. • The use-cases are heterogeneous, the sizes and resolutions of images are few provided, the text density is then difficult to estimate, images with significant text content are preferred. • Depending the use-cases, not all the methods work on curved text, a combination of curved and straight text is necessary. • All the methods use context to extract the text-line (i.e. font type, character size, line model). The size of characters could change a lot, the number of font is generally small (less to ten).

Datasets for text-line detection (2/2) 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Generation algorithm • Datasets • Setting Text-line density B1 B1 ejects B2 of dx,dy l2 Font context l1 d B2 dy dx θ l3 step 1 step 2 Size context The insert algorithm

In progress datasets 1. Groundtruth definition and setting 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets

Conclusions Conclusions # in progress work … # character recognition datasets are ready # bags of words still under packaging, but will be ready soon. Perspectives # middle term, experimentations with standard feature extraction methods [Roy’2008] [Valveny’2007] # long term, experimentations with bags of word and text/graphics documents [Delalandre’2007] [Wenyin’1997]

References (1/2) • R. Brown and M. Lybanon and L. K. Gronmeyer. Recognition of Handprinted Characters for Automated Cartography: A Progress Report. Proceedings of the SPIE, Vol. 205, 1979. • L.A. Fletcher & R. Kasturi. A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol (10), pp. 910-918 , 1988. • S.D. Zenzo; M.D. Buno; M. Meucci & A. Spirito. Optical recognition of hand-printed characters of any size, position, and orientation. IBM Journal of Research and Development, vol (36), pp. 487-501 , 1992. • H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp. 111-119 , 1999. • S. Adam; J.M. Ogier; C. Cariou; R. Mullot; J. Labiche & J. Gardes. Symbol and Character Recognition : Application to Engineering Drawings. International Journal on Document Analysis and Recognition (IJDAR), vol (3), pp. 89-101 , 2000. • T. Kanungo; G.A. Marton & O. Bulbu. Performance evaluation of two Arabic OCR products. Workshop on Advances in Computer-Assisted Recognition (AIPR) , SPIE Proceedings, vol (3584), pp. 76-83 , 1999. • S.V. Rice J. Kanai & T.A. Nartker. A Report on the Accuracy of OCR Devices. Information Science Research Institute, University of Nevada, USA, 1992. • S.V. Rice; J. Kanai & T.A. Nartker. An Evaluation of OCR Accuracy. Information Science Research Institute, University of Nevada, USA, 1993. • L. Wenyin & D. Dori. A Protocol for Performance Evaluation of Line Detection Algorithms. Machine Vision and Applications, vol (9), pp. 240-250 , 1997. • R.M. Brown. Handprinted Symbol Recognition System: A Very High Performance Approach To Pattern Analysis Of Free-form Symbols. Conference Southeastcon , pp. 5-8 , 1981. • H. Takahashi. Neural network architectures for rotated character recognition. International Conference on Pattern Recognition (ICPR) , pp. 623-626 , 1992. • Q. Chen. Evaluation of OCR algorithms for images with different spatial resolutions and noises. School of Information Technology and Engineering, University of Ottawa, Canada, 2003. • C. Choisy; H. Cecotti & A. Belaid. Character Rotation Absorption Using a Dynamic Neural Network Topology: Comparison With Invariant Features. International Conference on Enterprise Information Systems (ICEIS) , pp. 90-97 , 2004.

References (2/2) • H. Hase; T. Shinokawa; S. Tokai & C.Y. Suen. A robust method of recognizing multi-font rotated characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp. 363- 366 , 2004. • U. Pal; F. Kimura; K. Roy & T. Pal. Recognition of English Multi-oriented Characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp. 873-876 , 2006. • P.P. Roy; U. Pal & J. Llados. Multi-oriented character recognition from graphical documents. International Conference on Cognition and Recognition (ICCR) , pp. 30-35 , 2008. • U. Pal & P. P. Roy. Multi-oriented and curved text lines extraction from Indian documents. IEEE Transactions on Systems, Man and Cybernetics- Part B, vol (34), pp. 1676-1684 , 2004. • P.K. Loo & and C.L. Tan. Word and Sentence Extraction Using Irregular Pyramid. Workshop on Document Analysis System (DAS) , Lecture Notes in Computer Science (LNCS), vol (2423), pp. 307-318 , 2002. • H.C. Park; S.Y. Ok; Y.J. Yu & H.G. Cho. Word Extraction in Text/Graphic Mixed Image Using 3-Dimensional Graph Model. International Journal on Document Analysis and Recognition (IJDAR), vol (4), pp. 115 130 , 2001. • H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp. 111-119 , 1999. • C.L. Tan & P.O. Ng. Text extraction using pyramid. Pattern Recognition (PR), vol (31), pp. 63-72 , 1998. • S. He, N. Abe & C. L. Tan. A clustering-based approach to the separation of text strings from mixed text/graphics documents. International Conference on Pattern Recognition (ICPR) , pp. 706-710 , 1996. • M. Burge & G. Monagan. Extracting Words and Multi Part Symbols in Graphics Rich Documents. International Conference on Image Analysis and Processing (ICIAP) , 1995. • M. Deseilligny; H. Le Men & G. Stamon. Characters string recognition on maps, a method for high level reconstruction. International Conference on Document Analysis and Recognition (ICDAR) , pp. 249 252 , 1995. • E. Valveny; S. Tabbone; O. Ramos & E. Philippot. Performance Characterization of Shape Descriptors for Symbol Representation. Workshop on Graphics Recognition (GREC) , 2007. • M. Delalandre; T. Pridmore; E. Valveny; E. Trupin & H. Locteau. Building Synthetic Graphical Documents for Performance Evaluation. Workshop on Graphics Recognition (GREC) , Lecture Note in Computer Science (LNCS), vol (5046), pp. 288-298 , 2008.

Generating Synthetic Datasets for Evaluating OCR Performance in Text/Graphics Documents