1 / 16

Whole-Book Recognition using Mutual-Entropy-Driven Model Adaptation

Whole-Book Recognition using Mutual-Entropy-Driven Model Adaptation. Pingping Xiu* Henry S. Baird. DR&R XV January 29, 2008. Whole-Book Recognition. Motivation Millions of books are being scanned cover-to-cover with OCR results available on the web.

amanda
Download Presentation

Whole-Book Recognition using Mutual-Entropy-Driven Model Adaptation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole-Book Recognition using Mutual-Entropy-Driven Model Adaptation Pingping Xiu* Henry S. Baird DR&R XVJanuary 29, 2008

  2. Whole-Book Recognition Motivation • Millions of books are being scanned cover-to-cover with OCR results available on the web. • Many documents are strikingly isogenous, containing only a small fraction of all possible languages, words, typefaces, image qualities, layout styles, etc. • Research in the last decade or so shows that isogeny can be exploited to improve recognition fully automatically. ( Nagy & Baird (1994), Hong (1995), Sarkar (2000), Veeramachaneni (2002), Sarkar, Baird & Zhang, (2003) ) Given: a book’s images, an initial buggy OCR transcription, & a dictionary that might be incomplete … can we: improve recognition accuracy substantially & fully automatically?

  3. Fully-Automatic Model Adaptation Two models: initialized from the input iconic: a character-image classifier linguistic: a dictionary (list of valid words) Look for disagreements between the two models: measure mutual entropy between two probability distributions: • a posteriori probabilities of character classes • a posteriori probabilities of word classes Automatically correct both models to reduce disagreement: • We illustrate this approach on a small test set of real book-image data. • We show that we can drive improvements in both models, and can lower recognition error rate.

  4. Defining ‘Disagreements’ Mutual Entropy (ME): Character-level ME: Word-level ME: Passage-level ME:

  5. Model-Adaptation Policies • How to correct the iconic model: characters with largest character-level disagreement first • How to correct the linguistic model: words with largest word-level disagreement first • Minimize this objective function: drive passage-level mutual entropy down • Test experimentally: • can this policy achieve unsupervised improvement? • does it work better, the longer the passage length?

  6. Experimental Design – Training Training Image with its hOCR transcript (*) A dictionary containing all the true words in that page, except “the” (intentional damage) Iconic Model: Linguistic Model: * Google Inc., Book Search Dataset, Version V1.0, Volume 0, Page 28, Aug, 2007

  7. Experimental Design – Testing • Character classification: highly erroneous • Word classification: not too bad, but all “the”s are missing.

  8. Interpreting Disagreements • What to do next? Apply a series of iconic corrections. • The series of model correction: ‘t’ (wd 7 ch 1), ‘3’ (wd 8 ch 3), ‘t’ (wd 9 ch 4), etc.

  9. Effects of Iconic Adaptation Improved state, after iconic model corrections: Initial state, after inferring models from the input:

  10. Iconic Model Improves (a) Before the sequence of iconic model corrections: 1st & 2nd choices aren’t very different  low confidence • After iconic model corrections: 1st & 2nd choices differ much more  higher confidence

  11. Effects on Disagreement Distribution • Before iconic adaptation, disagreements are scattered throughout the passage. • After iconic adaptation, disagreements concentrate on errors due to defects in the linguistic model.

  12. Correcting the Linguistic Model • First, examine the word having the highest word-level ME: cluster (3,8,11) • Voting suggests that the correct word interpretation is ‘the’ -- which is then inserted into the dictionary, fully automatically.

  13. Algorithm & Results Summary • Unsupervised (fully automatic). • Greedy: multiple cycles, each cycle with two stages: iconic changes, then linguistic changes. • Character-level ME suggests priorities for iconic model changes. • Word-level ME suggests priorities for linguistic model changes. • Every model change must reduce passage-level ME. • Stopping rule: we’ve experimented with an adaptive threshold heuristic. • In the real-data example in the paper, recognition error rate was driven to zero---of course, this is a very short passage.

  14. Isogeny & Long Passages • If, as we expect, isogeny plays a crucial role, error rates should decline as passage-length increases. • A recent experiment (not in paper): a whole page and a 45k-word dictionary:

  15. Discussion • Mutual Entropy is only one information-theoretic framework for unsupervised model improvement • Even using ME, other interesting greedy & branch-&-bound heuristics come to mind • Minimizing passage-level ME may not necessarily minimize recognition error (experts disagree) • Proofs of global optimality seem hard: but we’re investigating…. • Scaling up to, say, 100-page passages is urgently needed

  16. Thanks! Pingping Xiu pix206@lehigh.edu Henry Baird baird@cse.lehigh.edu

More Related