Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) www.devbg.org
Hot News! • Microsoft Corporation just announced its strategic partnership with OpenFest • OpenFest is upgrading to Windows 7 and MS SQL Server 2008 = +
What is OCR? • Stands for Optical Character Recognition • Extracts the text from a given image
What is OCR? (2) • Invented by Gustav Tauschek • Tauschek obtained a patent on OCR • 1929 in Germany • 1935 in USA • Tauschek’s machine • Was a mechanical device • Uses templates, light and photodetector • When a light was directed towards the templates no light reach the photodetector
What is OCR? (3) • OCR Predicates electronic computers!
Project Tesseract • History of Tesseract • Open source OCR engine • Developed by HP between 1985 and 1995 • Never used in an HP product • Rated highly at The Fourth Annual Test of OCR Accuracy in 1995 • In 2005 HP transferred Tesseract to the ISRI and released it as open source • ISRI == Information Science Research Institute • The development is currently led by Google
Project Tesseract (2) • Tesseract is an OCR Engine and is NOT a complete OCR program • Originally intended to serve as a component part of other programs • Works from the command line • Has no page layout analysis (will have soon) • Has no output formatting • Has no GUI
Tesseract Versions • Stable build – version 2.04 • Has some documentation • Can be easily trained on a new language • Has memory leaks • Development version – 3.0 (unstable) • Not documented, unstable • Language files are not compatible (need special conversion)
Demo Downloading, Compiling and Running Tesseract (Latest Version)
How Tesseract Works? • Adaptive thresholding on the input image • Analyze connected components in the binary image • Find text lines and words • First pass of recognition process • Attempts to recognize each word in turn • Satisfactory words are passed to adaptive trainer • Lessons learned are employed in a second pass • Used for words not satisfactory recognized • Producing the output text
Training Tesseract • Prepare training images and .box files • Files: lang.tif and lang.box • 2.04 supports only uncompressed TIFFs • .box files contain characters with coordinates • Extract the character features • This produces lang.tr • Perform character clustering tesseract lang.tif junk nobatch box.train mftraining lang.tr cntraining lang.tr
Training Tesseract (2) • Compute the character set properties • isLetter, isDigit, isUpper, isPunctuation, … • Unicode provides this information • Train language dictionaries • List of all words in the target language • List of the most frequent words unicharset_extractor lang.box wordlist2dawg freq-words.txt lang.freq-dawg wordlist2dawg all-words.txt lang.word-dawg
Demo Training Tesseract for Bulgarian and English (Bulgarian for IT Professionals)
Other OCR Engines • OCRopus • Open source document analysis and OCR system • Also funded by Google • Provides much of the layout analysis functionality missing from Tesseract • Capable to use engines other than Tesseract • http://code.google.com/p/ocropus/
Other OCR Engines (2) • ABBYY FineReader OCR • Supports a big number of features • Known for its highly accuracy • Commercial • Microsoft Office Document Imaging (MODI) • Supports editing documents scanned by Microsoft Office Document Scanning • It was firstly introduced in MS Office XP • Commercial
Commercial OCR vs. Tesseract • 100+ languages • Accuracy is good now • Sophisticated app with complex UI • Works on complex magazine pages • Windows mostly • Costs $130-$500 • 6 languages • Accuracy was good in 1995 • No UI yet • Page layout analysis coming soon • Running on Linux, Mac, Windows, more.. • Open source – Free!
Tesseract Future • Page layout analysis • More languages • Improve accuracy • Add a UI • Support for connected scripts (like Arabian)
Links • For more information see: • http://code.google.com/p/tesseract-ocr/ • http://en.wikipedia.org/wiki/Optical_character_recognition • http://tesseract-ocr.repairfaq.org/downloads/tesseract_overview.pdf • Speakers • http://nakov.com/blog • http://veskokolev.blogspot.com
Tesseract OCR Questions?