1 / 22

Versatile Document Image Content Extraction

Versatile Document Image Content Extraction. Henry S. Baird Michael A. Moll Jean Nonnemaker Matthew R. Casey Don L. Delorenzo. Document Image Content Extraction Problem. Given an image of a document

gina
Download Presentation

Versatile Document Image Content Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker Matthew R. Casey Don L. Delorenzo

  2. Document Image Content Extraction Problem • Given an image of a document Find regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc

  3. Difficulties • Vast diversity of document types • Arduous data collection • How big is a representative training set? • Expense of preparing correctly labeled “ground-truthed” samples • Lack of consensus on how to evaluate performance

  4. Our Research Goals • Versatility First • Beware “brittle” or narrow approaches • Develop methods that work across broadest possible spectrum of document and image types • Voracious Classifiers • Belief that accuracy of a classifier has more to do with training data than other considerations • Want to train on extremely large (and representative) data sets (in reasonable amounts of time) • Extremely High Speed Classification • Ideally, perform nearly at I/O rates (as fast as images can be read) Too ambitious?

  5. Related Strategies (for the future) • Amplification • Real ground-truthed training samples are hard to find, expensive to generate and difficult to ensure coverage • Want to use real samples as ‘seeds’ for massive synthetic generation of pseudo randomly perturbed samples for use in supplementary training • Confidence Before Accuracy • Confidence is maybe more important than accuracy, since even modest accuracy (across all cases) can be useful • Near-Infinite Space • Design for best performance in near future when main memory will be orders of magnitude larger and faster • Data-Driven Design • Avoid arbitrary engineering decisions such as choice of features, instead allowing training data to determine this

  6. Document Images • Range of document and image types • Color, grey-level, black and white • Any size or resolution • Lots of file formats (TIFF, JPEG, PNG, etc) • Pre-processing step of converting images into three channel color PNG file in HSL (Hue, Saturation, Luminance) color space • Bi-level and gray images will primarily map into Luminance component

  7. Document Image Content Types • Now: Handwriting, Machine Print, Line Art, Photos, Junk/Noise, Blank • Soon: Maps, Mathematic Equations, Engineering Drawings, Chemical Diagrams • Gathering large of collection of electronic images: 7123 page images so far • We attempt to collect samples for each content type in black and white, grey scale and color • Avoid arbitrary image processing decisions of our own • Carefully “zoned” images are a rare commodity • Our software accepts existing ground truth in the form of rectangular zones • We have developed a ground truthing tool for images that are not zoned

  8. Coverage of Image and Content Types

  9. Statistical Framework for Classification • Each training & test sample will be a pixel, not a region • Want to avoid arbitrariness and restrictiveness associated with the choice of a limited class of shapes • Since each image can contain millions of pixels, it easy to exceed 1 billion training samples • This policy suggested by Thomas Breuel

  10. Example of Classifying Pixels Photo Machine Print Hand writing ~75% correct Input Image Classification Output

  11. Features • Simple, local features of each pixel • Average luminosity of every pixel in 1x1, 3x3, 9x9 and 27x27 boxes around given pixel • Average luminosity of 20 pixels on either side of given pixel on horizontal and vertical lines and lines of slope +/- 1 and 2 • Also for each box and line, the average change and maximum change from one pixel to a neighbor • Choice of features is merely expedient • Expect to refine indefinitely

  12. A Nearly Ideal Classifier (For Our Purposes) • K Nearest Neighbor Algorithm • Error rate approaches no worse than twice Bayes error • Generalizes directly (more than, e.g., SVMs) to more than two classes • Competitive in accuracy with more recently developed methodologies • We have implemented a straightforward exhaustive 5-NN search • Aware of many techniques (editing, tree search, etc) for speeding up kNN that work well in practice, but do not appear to in the broadest range of cases • We hope to exploit highly non-uniform data distributions • Explore using hashing techniques with geometric tree search • Intrinsic dimensionality of data seems low

  13. Adaptive k-d Trees • Recursively partition set of points in stages • At each stage, divide one partition into two sub partitions • Assume it is possible to choose cuts to achieve balance • Ensures find operations operate in O(log n) time at worst • Final partitions are generally hyper rectangles • This approximates kNN under Infinity Norm • Pruning power of k-d trees (Bentley) speed up range searches • Given a search point (a test sample), it is fast to find the enclosing hyper rectangle • Adaptive k-d trees guarantee shallow trees

  14. We Use Non-adaptive k-d Trees • Constructed in manner similar to adaptive k-d trees, except the distribution of the data is ignored in generating cuts • Suppose upper and lower bounds are known for each feature, then cuts can be placed at midpoints of these bounds • No balance guarantee and time and space optimality properties of adaptive k-d trees are lost • However, values of cut thresholds can be predicted and as result the total number of cuts, r, is known. • Hyperrectangle any sample lies in can be computed in O(r) time • Computing the cuts is so fast it is negligible

  15. Bit-Interleaving Addresses • Partitions can be addressed using bit-interleaving 100111 1 2 1 1 6 0 0 4 0 1 5 0 1 3 0 1 0 1

  16. Assumption of Bit-Interleaving • Since in this context, we expect that our data is non-uniformly distributed in feature space, only a small fraction of partitions should contain any training data • Therefore very few distinct bit-interleaved addresses should occur, making it possible to use a dictionary data structure to store them • Experiments show that the number of occupied partitions as a function of their bit-interleaved address, is asymptotically cubic (far better than exponential!)

  17. Cubic Growth of Populated Bins

  18. Initial Results and Analysis • n = 844,525 training points, d = 15 and tested on 192,405 points • Brute Force 5NN classified 70% correctly but required 163 billion distance calculations • Hashing bit-interleaved addresses of length r = 32 classified 66% correctly, with speedup of 148 times • Of course this only approximates kNN • Hash into a cell and does not contain the kNN neighbors

  19. Approximate, but Quite Good Photo Machine Print Hand writing ~75% correct Input Image Classification Output

  20. Speed Tradeoff

  21. Future Work • Much larger scale experiments • Wider range of content types • More and better features • Explore refined approximate kNN methods • How does BIA approach behave under vastly larger data sets • Paging hash tables that grow too large • Exploit style consistency (isogeny) • Compare to CARTs, Locality Sensitive Hashing, etc

  22. Thank You! Henry S. Baird Michael A. Moll

More Related