Comparison of Binarization algorithm in Indian Language OCR

1. Comparison of Binarization algorithm in Indian Language OCR by Tushar Patnaik , Shalu Gupta, Deepak Arya

2. Content Introduction Binarization Algorithms Comparison of Binarization Algorithms Evaluation Approach and Results Conclusion References

3. Introduction Binarization is a process where each pixel in an image is converted into one bit and value '1' or '0� is assigned depending upon the threshold value of all the pixel. If greater then threshold value then its '1' otherwise its '0'. Binarization - Image thresholding Threshold a grey level image into binary image A simple but effective tool to separate the objects from background g(x, y) = 1 if (x, y)>=T = 0 otherwise Selection of optimum binarization algorithm has proved to be difficult in documents with variation in contrast and illumination, quality of text.

4. Binarization Algorithms Used Following algorithms are used in OCR project. Otsu Adaptive Sauvola

5. Otsu Algorithm Otsu method is simple and effective. Otsu calculates a global threshold by accepting the existence of two classes, foreground and background, and choosing the threshold that maximize the inter class variance. Compute the histogram and probabilities of each intensity level. Set up initial class probability ?1(0) and class mean value �1(0). Update class probability and class mean value for each possible thresholds t=1..max. intensity Desired threshold corresponds to maximum variance s2b(t) s2b(t)= ?1(t) ?2(t)[�1(t)- �2(t)]2

6. Adaptive Algorithm The adaptive binarization technique has been used for pre processing any document image which is having noise and other type of distortions that occur during scanning process. Extends to Otsu�s method .The threshold value is calculated using otsu algorithm. Images is divided into NxN window size. The selection of window size depend upon the thickness of characters. For thinner characters smaller window size is chosen. The non linear quadratic filter is applied to each window to fine tune the threshold value and for noise reduction. The optimum threshold value is decided after filtering .Based on this threshold value, 0 or 1 is assigned to each pixel in the image.

7. Sauvola Algorithm Sauvola binarization convert a grey tone image into two tone image. For bad quality image global thresholding cannot work well. Sauvola binarization technique is window�based local thresholding. Calculates a local threshold for each image pixel at (x,y) by using the intensity of pixels within a small window W(x,y). The threshold T( x,y) is computed using the following formula. Sauvola�s formula: T(x,y)=Int [ X.(1+k.(s/R - 1))] where X is the mean of gray values in the considered window W(x,y), s is the standard deviation of the gray levels and R is the dynamic range of the variance, k is a constant (usually 0.5 but may be in the range 0 to 1).

8. Comparison of the Binarization Methods Two phases Method has been proposed to compare these algorithms. Calculate SNR Calculate OCR errors

9. Evaluation Approach Choose smoothen images Add noise to images Binarize the image Calculate SNR Calculate OCR errors Conclusion

10. Choose smoothen images Smoothen images are those in which there is no skew, noise and two tone image (0 or 1). Hundred smoothen images are tested with binarization algorithms. All of the images are taken from OCR Project corpus.

11. Add noise to images

13. Output of Binarization Algorithms Gaussian Noise Output

14. Output of Binarization Algorithms Poisson Noise Output

15. Output of Binarization Algorithms Speckle Noise Output

16. Output of Binarization Algorithms Localvar Noise Output

17. Calculate SNR The ideal way of evaluating binarization algorithm should be able to decide, for each pixel, if it has finally succeeded the right color (black or white) after the binarization. Every single pixel value of binarization output is compared with the corresponding pixel in the original smoothen image . Let x( i, j ) represent the value of the i�th row and j�th column pixel in the original smoothen image and y( i, j ) represent the value of the corresponding pixel in the output image (Binarized Image). We first calculate local error for the image e(i,j)=x(i,j)-y(i,j) If pixel value is in right colour then the value of local error is 0 otherwise it will be 255.

18. Calculate SNR cont� After calculating local error next step is to find SNR. SNR is defined as the ratio of average signal power to average noise power and for an MxN image. SNR(DB ) = 10 log 10 ?? x(i, j) / ?? (x(i, j ) - y(i, j))2 i j i j

19. SNR comparison with different binarization algorithms

20. SNR comparison with different binarization algorithms

21. Calculate OCR Errors Through SNR only, optimality of a binarization algorithm cannot be predicted, To accurately predict the accuracy of binarization algorithm, OCR output is taken of all the binarized images. Ground truth data is used for OCR evaluation. Information of document image component at different levels like block/paragraph, line, word etc is stored in ground truth data. The error rates in OCR output with respect to ground truth is calculated using Levenshtein distance. Levenshtein distance gives the measure of inequality in terms of insertion, deletion or substitution at character level .

22. Calculate OCR Errors cont� A browser window has been created on which ground truth data and OCR output with substitution, insertion and deleted characters are highlighted with different colours.

23. Compare the OCR Results

24. Compare the OCR Results cont�

25. Compare the OCR Results cont�

26. Conclusion A technique is proposed for the evaluation of binarization algorithms. Three existing binarization algorithms (Otsu, Adaptive,Sauvola) have tested. Experiments was performed on 100 document images. Adaptive SNR is maximum and OCR errors are least. Adaptive is working better on Gaussian, poisson and localvar noisy document images because errors are less as compared to Otsu and Sauvola. Sauvola algorithm is working better on speckle noise document images

27. Acknowledgement

28. References Jiang Duan, Mengyang Zhang, Qing Li, "A Multi�stage Adaptive Binarization Scheme for Document Images, " cso, vol. 1, pp.867�869, 2009 International Joint Conference on Computational Sciences and Optimization, 2009 Liju Dong, Ge Yu, "An Optimization�Based Approach to Image Binarization," cit, pp.165�170, Fourth International Conference on Computer and Information Technology (CIT'04), 2004 J. He, Q. D. M. Do, A. C. Downton, J. H. Kim, "A Comparison of Binarization Methods for Historical Archive Documents," icdar, pp.538�542, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005 Ergina Kavallieratou, Stamatatos Stathis, "Adaptive Binarization of Historical Document Images," icpr, vol. 3, pp.742�745, 18th International Conference on Pattern Recognition (ICPR'06) Volume 3, 2006 Carlos A. B. Mello, Adriano L.I.Oliveira, �ngel S�nchez: Historical Document Image Binarization.VISAPP (1) 2008: 108�113 B. Gatos, I. Pratikakis, S.J. Perantonis, "Efficient Binarization of Historical and Degraded Document Images,� das, pp.447�454, 2008 The Eighth IAPR International Workshop on Document Analysis Systems, 2008 Pavlos Stathis, Ergina Kavallieratou and Nikos Papamarkos � An Evaluation Survey of Binarization Algorithms on Historical Documents � pp. 1-4,19th International Conference on Pattern Recognition ,Dec.2008

Comparison of Binarization algorithm in Indian Language OCR

Comparison of Binarization algorithm in Indian Language OCR

Presentation Transcript

COMPARISON OF TTM TEXT WITH INDIAN TEXT

A Comparison of Chinese and Indian Music

A Comparison of Three Language Assessment Tools

OCR and the Welsh Language

OCR

Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm

Algorithm Comparison

Pacific Northwest Indian Comparison

Indian Language Desktop

Comparison of Indian and US Navy Decompression Practices

OCR Algorithm example

Comparison of Indian GAAP with IFRS

Thai OCR using Template Matching Algorithm

A Comparison of RDF Query Language

AVL ALGORITHM VISUALIZATION LANGUAGE

Comparison to simulated annealing clustering of leach-c algorithm *

Comparison of American Sign Language Versus Worldwide Sign Language Dialects

Comparison Indian Chief Dark Horse VS Indian Roadmaster

A Comparison of ABK Means Algorithm with Traditional Algorithms

COMPARISON OF TTM TEXT WITH INDIAN TEXT

Indian Language Facts