1 / 31

Document Image Analysis Lecture 12: Word Segmentation

Document Image Analysis Lecture 12: Word Segmentation. Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center. The course, recently…. We studied symbol recognition, classifiers and their combinations

Download Presentation

Document Image Analysis Lecture 12: Word Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Image AnalysisLecture 12: Word Segmentation Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS294-9 Fall 2000

  2. The course, recently…. • We studied symbol recognition, classifiers and their combinations • Word recognition as distinct from characters UC Berkeley CS294-9 Fall 2000

  3. A good segmentation method (or several) is handy • We cannot rely on a lexicon to have all words (names, proper nouns, numbers, acronyms) • Insisting that words be in the lexicon does not mean they are correct. Powerpoint tries to refuse misspell as mispell since the latter is not in the dictionary! • Good segmentation means that the symbol based recognition has a better chance of success UC Berkeley CS294-9 Fall 2000

  4. Segmentation/ Naïve or clever • Numerous papers on the subject • Some without strong models (e.g. cut at thin parts) • Some with exhaustive search / template matching • Some with learning/ internal comparisons UC Berkeley CS294-9 Fall 2000

  5. Naïve connected component analysis can’t come close… • Characters like “ij:; Ξâ% are separated • Ligatures are not separated: ffl, ŒÆœ ffi • Vertical cuts between touching characters will not ordinarily work for italics THIS IS ULTRA CONDENSED ..TZ this is times italic . (other problems: X2 , ) UC Berkeley CS294-9 Fall 2000

  6. Papers of interest on segmentation • Tsujimoto and Asada • Bayer and Kressel • Tao Hong’s (1995) PhD on Degraded Text Recognition UC Berkeley CS294-9 Fall 2000

  7. Segmentation + Clustering (Tao Hong) UC Berkeley CS294-9 Fall 2000

  8. Can lead to decoding! UC Berkeley CS294-9 Fall 2000

  9. Sometimes the image itself holds a key to decoding… UC Berkeley CS294-9 Fall 2000

  10. Visual inter-word relations UC Berkeley CS294-9 Fall 2000

  11. An example text block showing visual inter-word relationships UC Berkeley CS294-9 Fall 2000

  12. Pattern matching can lead to identifying a segment UC Berkeley CS294-9 Fall 2000

  13. UC Berkeley CS294-9 Fall 2000

  14. Where this fits… UC Berkeley CS294-9 Fall 2000

  15. Example UC Berkeley CS294-9 Fall 2000

  16. Tsujimoto & Asada: Overview UC Berkeley CS294-9 Fall 2000

  17. Resolve the touching characters: • New metric for finding breaks (find plausible breaks • Use knowledge about “the usual suspects” rn/m k/lc d/cl … (limits search substantially) UC Berkeley CS294-9 Fall 2000

  18. Metric, pre-processing ANDing columns for profile removing slant from italics UC Berkeley CS294-9 Fall 2000

  19. Choosing break candidates UC Berkeley CS294-9 Fall 2000

  20. Decision Tree for “The” UC Berkeley CS294-9 Fall 2000

  21. Tree search • Depth first, looking for solution to the string matching, in sequence. • Some partitions are penalized (but not eliminated) if the segmentation point is uncertain. • Segments are matched to omnifont templates (“multiple similarity method..”) UC Berkeley CS294-9 Fall 2000

  22. Reexamined explanations This might be mistaken for This Etc… 30 confusions UC Berkeley CS294-9 Fall 2000

  23. Some tough calls… UC Berkeley CS294-9 Fall 2000

  24. Unbelievable accuracy… UC Berkeley CS294-9 Fall 2000

  25. A different, perhaps more general method (Bayer, Kressel) • Goal: find the column position(s) at which characters are touching • Treat as a systematic classification problem • Learn from a data base containing labelled merged characters • Collect real life data; get human breakpoints [or could be synthetic, I suppose] • Find appropriate feature set • Learn the features of touching characters • Hypothesize column breaks • Application: postal addresses, other stuff too UC Berkeley CS294-9 Fall 2000

  26. Database of touching chars ….2158 patterns UC Berkeley CS294-9 Fall 2000

  27. Big idea Rather than represent the breaks as low points in the projection profile, represent the breaks in the natural context of touching characters by actual example, suitably normalized for size (15-30 pixels high). These locations are manually marked. UC Berkeley CS294-9 Fall 2000

  28. Local feature set describing cut locations / measures of similarity • Number of black pixels (= projection profile!) • Number of white pixels counting from top/bottom • Number of white-black transitions • Number of identical b or w pixels next to this column (derivative of pp?) UC Berkeley CS294-9 Fall 2000

  29. Global feature set describing cut locations / measures of similarity • Width to height ratio of full image (wider suggests touching characters) • Width to height ratio of the image AFTER cutting(s) • Number of white-black transitions • Number of identical b or w pixels next to this column (derivative of pp?) UC Berkeley CS294-9 Fall 2000

  30. Illustration of the strategy UC Berkeley CS294-9 Fall 2000

  31. How accurate, how fast? (cut location) • Finding cuts: 7.8% error in learning set, 7.2%(!) on test set • 22% of the no-cut regions had errors • Best results used 50-feature classifier using 9 column width • Cost for one image cut-analysis  one character analysis • Validates statistics > heuristics.. UC Berkeley CS294-9 Fall 2000

More Related