Farsi Handwritten Word Recognition Using Continuous Hidden Markov Models and Structural Features

Download Presentation

Farsi Handwritten Word Recognition Using Continuous Hidden Markov Models and Structural Features

Loading in 2 Seconds...

- 295 Views
- Uploaded on
- Presentation posted in: General

Farsi Handwritten Word Recognition Using Continuous Hidden Markov Models and Structural Features

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Farsi Handwritten Word Recognition Using Continuous Hidden Markov Models and Structural Features

M. M. HajiCSE DepartmentShiraz University

January 2005

- Introduction
- Preprocessing
- Text Segmentation
- Document Image Binarization
- Skew and Slant Correction
- Skeletonization

- Structural Feature Extraction
- Multi-CHMM Recognition
- Conclusion and Discussion

- One of the most challenging problems in Artificial Intelligence.
- Words are rather complex patterns, having much variability in handwriting style.
- Performance of handwriting recognition systems is still far from human's both in terms of accuracy and speed.

- Previous Research:
- Dehghan et al. (2001). "Handwritten Farsi (Arabic) Word Recognition: A Holistic Approach Using Discrete HMM", Pattern Recognition, vol. 34, pp. 1057-1065.
- Dehghan et al. (2001). "Unconstrained Farsi Handwritten Word Recognition Using Fuzzy Vector Quantization and Hidden Markov Models", Pattern Recognition Letters, vol. 22, pp. 209-214.
- A maximum recognition rate of 65% for a 198-word lexicon!

- Holistic Strategies
- AnalyticalStrategies

Implicit Segmentation

Explicit Segmentation

- Recognition on the whole representation of a word.
- No attempt to segment a word to its individual characters.
- Necessary to segment the text lines into words.
- Intra-word space is sometimes greater than inter-word space!

- Using a lexicon, a list of the allowed interpretations of the input word image.
- The error rate increases with the lexicon size.
- Successful for postal address recognition or bank check reading where lexicon is limited and small.

Explicit Segmentation:

- Isolating single letters which are then separately recognized usually by neural networks .
- Successful for English machine-printed text.
- Arabic/Farsi texts whether machine-printed or handwritten are cursive.
- Cursiveness and character overlapping are the main challenges.

Implicit Segmentation:

- Converting the text (line or word) image into a sequence of small size units.
- Recognition at this intermediate level rather than the word or character level usually by Hidden Markov Model (HMM).
- Each unit may be a part of a letter, so a number of successive units can belong to a single letter.

- Detecting text regions in an image (removing non-text components).
- Applications in document image analysis and understanding, image compression and content-based image retrieval.
- Document image binarization and skew correction algorithms usually require predominant text area to have an accurate estimate of text characteristics.
- Numerous methods have been proposed (an extensive literature).
- There is no general method to detect arbitrary text strings.
- In the most general form, detection must be:
- insensitive to noise, background model and lighting conditions and,
- invariant to text language, color, size, font and orientation even in a same image!

- We believe that a text segmentation algorithm should have adaptation and learning capability.
- A learner usually needs much time and training data to achieve satisfactory results, which restricts its practicality.
- A simple procedure was developed for generating training data from manually segmented images.
- A Naive Bayes Classifier (NBC) was utilized, which is fast both in training and application phase.
- Surprisingly excellent results were obtained by this simple classifier!

- DCT-18 features
- 10,000 training instance
- Naive Bayes Classification:

- Naive Bayes Classification:

P(Text) = P(Non-text) = 0.5.

- Converting gray-scale images into two-level images.
- Many vision algorithms and operators only handle two-level images.
- Applied in primary steps of a vision algorithm.
- Selecting a proper threshold surface.
- Challenging for images with poor contrast, strong noise and variable modalities in histograms.
- Global and local (adaptive) algorithms.
- General and special-purpose algorithms.

Four different algorithms for document image binarization were compared and contrasted:

- Otsu, N. (Jan. 1979). “A Threshold Selection Method from Gray Level Histograms”, IEEE Trans. on Systems, Man and Cybernetics, vol. 9, pp. 62-66.
- Niblack, W. (1989).An Introduction to Digital Image Processing, Prentice Hall, Englewood Cliffs, pp. 115-116.
- Wu, V. and Manmatha, R. (Jan. 1998). "Document Image Clean-Up and Binarization", Proceedings of SPIE conference on Document Recognition.
- Liu, Y. and Srihari, S. N. (May 1997). “Document Image Binarization Based on Texture Features”, IEEE Trans. on PAMI, vol. 19(5), pp. 540-544.

global, general purpose

local, general-purpose

local, special-purpose

global, special-purpose

Input

Histogram

Niblack

Otsu

Wu and Manmatha

Liu and Srihari

- Quality improvement by preprocessing and postprocessing.
- Preprocessing:
- Taylor, M. J. and Dance, C. R. (Sep. 1998). "Enhancement of Document Images from Cameras", Proceedings of SPIE conference on Document Recognition, pp. 230-241.

- Postprocessing:
- Trier, D. and Taxt, T. (March 1995). "Evaluation of Binarization Methods for Document Images", IEEE Trans. on PAMI, vol. 17(3), pp. 312-315.

super-resolution

- The angle that text lines deviate from the x-axis.
- Page decomposition techniques require properly aligned images as input.
- 3 types:
- global skew
- multiple skew
- non-uniform skew

- “Skew correction" is applied by a rotation after "skew detection“.

- Categories based on the underlying techniques:
- Projection Profile
- Correlation
- Hough Transform
- Mathematical Morphology
- Fourier Transform
- Artificial Neural Networks
- Nearest-Neighbor Clustering

- The projection profile at the global skew angle of the document has narrow peaks and deep valleys.

- Projection profile technique:

goodness measure

- Limiting the range of skew angles.
- Binary search for finding the maximizer of a function.
- Computing the sum of pixels along parallel lines at an angle, instead of rotation at the angle.
- Reducing the size of input image, as much as structure of text lines is preserved.
- MIN, MAX downsampling

- Local skew correction, after line segmentation, by robust line fitting.

uniform

non-uniform

- The deviation of average near-vertical strokes from the vertical direction.
- Occurring in handwritten and machine-printed texts.
اراک

- Slant is non-informative.
- The average slant angle is estimated first and then a shear transformation in horizontal direction is applied to the word (or line) image to correct its slant.

- The most effective methods are based on the analysis of vertical projection profiles (histograms) at various angles.
- Identical to the projection profile based methods for skew correction, except that:
- The histograms are computed in vertical rather than horizontal direction.
- Shear transformation is used instead of rotation.

- Accurate result for handwritten words with uniform slant.
- Robust to noise.

- Projection profile technique:

goodness measure

- Postprocessing:
- Smoothing jagged edges.

…

and after smoothing

after slant correction

A part of a slanted word

- Skeletonization or medial axis transform (MAT) of a shape has been one the most surveyed problems in image processing and machine vision.
- A skeletonization (thinning) algorithm transforms a shape into arcs and curves of thickness one which is called skeleton.
- An ideal skeleton has the following properties:
- retaining basic structural properties of the original shape
- well-centered
- well-connected
- precisely reconstructable
- robust

- Simplifying classification:
- Diminishing variability and distortion of instances of one class.
- Reducing the amount of data to be handled.

- Proved to be effective in pattern recognition problems:
- Character recognition
- Fingerprint recognition
- Chromosome recognition
- …

- Providing compact representations and structural analysis of objects.

Five different skeletonization algorithms were compared and contrasted with the main focus on preserving text characteristics:

- Naccache, N. J. and Shinghal, R. (1984). "SPTA: A Proposed Algorithm for Digital Pictures", IEEE Trans. on Systems, Man and Cybernetics, vol. SMC-14(3), pp. 409-418.
- Zhang, T. Y. and Suen, C. Y. (1984). "A Fast Parallel Algorithm for Thinning Digital Patterns", Comm. ACM, vol. 27(3), pp. 236-239.
- Ji, L. and Piper, J. (1992). "Fast Homotopy-Preserving Skeletons Using Mathematical Morphology", IEEE Trans. on PAMI, vol. 14(6), pp. 653 - 664.
- Sajjadi, M. R. (Oct. 1996). "Skeletonization of Persian Characters", M. Sc. Thesis, Computer Science and Engineering Department, Shiraz University, Iran.
- Huang, L., Wan, G. and Liu, C. (2003). "An Improved Parallel Thinning Algorithm", Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), pp. 780-783.

Homotopy-Preserving

Input

Zhang-Suen

Huang et al.

SPTA

DTSA

Homotopy-Preserving

Zhang-Suen

Input

Huang et al.

SPTA

DTSA

SPTA

Input

robustness to border noise

DTSA

Huang et al.

- Postprocessing:
- Removing spurious branches

- Modification:
- Removing 4-connectivity, and preserving 8-connectivity of the pattern.

…

- The connectivity number Cn:

end-point

dot

Cn=0

Cn=1

Cn=2

branch-point

cross-point

Cn=2

Cn=3

Cn=4

- Capable of tolerating much variation.
- Not robust to noise.
- Hard to extract.
- 1D HMM needs 1D observation sequence.
- Converting 2D word image into a 1D signal.
- speech recognition, online handwritten recognition: 1D signal.
- offline handwritten recognition: 2D signal.

- Converting the word skeleton into a graph.
- Tracing the edges in a canonical order:

- Loop Extraction:
- Important distinctive features.
- Making the number of strokes smaller:
- Easier Modeling
- Lower Computational Cost

- Different types of loops:
- simple-loop
- multi-link-loop
- double-loop

- A DFS algorithm was written to find complex loops in the word graph.

هـ

ـهـ

ـصـ

ـط

ـمـ

ـو

ـه

...

صـ

ص

ف

مـ

و

...

- Each edge is transformed into a 10D feature vector:
- Normalized length feature (f1)
- Curvature feature (f2)
- Slope feature (f3)
- Connection type feature (f4)
- Endpoint distance feature (f5 )
- Number of segments feature (f6 )
- Curved features (f7-f10)

- Independent of the baseline location.
- Invariance against scaling, translation and rotation.

1: [0.68, 1.00, 6, 0 , 0.05, 1, 0.0, 0.0, 0.7, 0.0]

2: [0.11, 1.01, 6, 1 , 0.23, 1, 0.0, 0.0, 0.0, 0.0]

3: [2.00, 3.00, 8, 10, 0.00, 0, 0.0, 0.0, 0.0, 0.0]

...

- Signal Modeling:
- Deterministic
- Stochastic:
- Characterizing the signal by a parametric random process.

- HMM is a widely used statistical (stochastic) model:
- The most widely used technique in modern ASR systems.

- Speech and handwritten text are similar:
- Symbols with ambiguous boundaries.
- Symbols with variations in appearance.

- Not modeling the whole pattern as a single feature vector, exploring the relationship between consecutive segments.

- Nondeterministic finite state machines:
- Probabilistic state transition.
- Each state is associated with a random function.
- Unknown state sequence.
- Some probabilistic function of the state sequence can be seen.

N:The Number of states of the model

S={s1, s2, ..., sN}: The set of states

∏ = {πi= P(si at t = 1)}:The initial state probabilities

A = {aij = P(sj at t+1 | si at t)}:The state transition probabilities

M:The Number of observation symbols

V = {v1, v2, ..., vM}:The set of possible observation symbols

B = {bi(vk) = P(vk at t | si at t}:The symbol emission probabilities

Ot:The observed symbol at time t

T:The length of observation sequence

λ = (A, B, ∏):The compact notation to denote the HMM.

A 5-state Left-to-Right HMM

A 5-state Left-to-Right HMM with maximum relative forward jump of 2

The Three Fundamental Problems:

1. Given a model λ = (A, B, ∏), how do we compute P(O | λ), the probability of occurrence of the observation seq. O = O1, O2, ..., OT.

The Forward-Backward Algorithm

2. Given the observation sequence O and a model λ, how do we choose a state sequence S = s1, s2, ..., sT so that P(O, S | λ) is maximized, i.e. finding a state sequence that best explains the observation.

The Viterbi Algorithm

3. Given the observation sequence O, how do we adjust the model parameters λ = (A, B, ∏) so that P(O | λ) or P(O, S | λ) is maximized. i.e. finding a model that best explains the observed data.

The Baum-Welch Algorithm, The Segmental K-means Algorithm

- Discrete HMM:
- Discrete observation sequences: V = {v1, v2, ..., vM}.
- A codebook obtained by Vector Quantization (VQ).
- Codebook size?

- Distortion: information loss due to the quantization error!

- Continuous Hidden Markov Model (CHMM):
- Overcoming the distortion problem.
- Requiring more parameters → more memory
- More deliberate initialization techniques:
- Diverging with randomly selected initial parameters!

Multivariate Gaussian mixture:

cim: The mth mixture gain coefficient in state i

μim: The mean of the mth mixture in state i

∑im: The covariance of the mth mixture in state i

M: The number of mixtures used

K: The dimensionality of the observation space

- Two-Stage Skew Correction
- Postponed Binarization

- The recognition system was trained and evaluated on a dataset of 100 city names of Iran.
- A pattern recognition problem with 100 classes was considered.
- Most samples in the dataset were automatically generated by a Java program drawing input string with different fonts, sizes and orientations on output image.
- The dataset contains 150 samples for each word.

1-best recognized

1-best recognized

1-best recognized

1-best recognized

1-best recognized

3-best recognized:

1. زنجان 2. اصفهان 3. دامغان

4-best recognized:

1. قشم 2. قم 3. مرند 4. مشهد

Not N-best recognized, for N ≤ 20

Not N-best recognized, for N ≤ 20

Not N-best recognized, for N ≤ 20

- The first work to use CHMMs with structural features to recognize Farsi handwritten words.
- A complete offline recognition system for Farsi handwritten words.
- A new machine learning approach based on the NBC for text segmentation.
- Comparing and contrasting different algorithms for:
- Binarization
- Skew and Slant Correction
- Skeletonization

- Excellent generalization performance.
- A maximum recognition rate of 82% on our dataset of size 100.

Thanks for your attention

Please feel free to ask any question

m2haji@yahoo.com