1 / 20

Document Analysis and Recognition

Document Analysis and Recognition. CS 661. What is a Document?. A written or printed paper that bears the original, official, or legal form of something and can be used to furnish decisive evidence or information.

althea
Download Presentation

Document Analysis and Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Analysis and Recognition CS 661

  2. What is a Document? • A written or printed paper that bears the original, official, or legal form of something and can be used to furnish decisive evidence or information. • Something, such as a recording or a photograph, that can be used to furnish evidence or information. • A writing that contains information. • Computer Science. A piece of work created with an application, as by a word processor. • Computer Science. A computer file that is not an executable file and contains data for use by applications

  3. Document Image Analysis • DIA is the theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer • DIA is a subfield of Digital Image processing • Digital images of natural objects: X-rays, fingerprints, faces, scenery, etc. are NOT part of DIA • Digital images of symbolic objects: Postal addresses, printed articles, forms, music sheets, engineering drawings, topographic maps belong to DIA • Source: Scanners, printers, fax machines, hand! • Incidental text: license plates, billboards, subtitles, in photos and video • WWW ?? • DIA’s grand goal is take us to the land of paperless office

  4. Paperless Office? • Traditional transmission and storage of information has been by paper documents • Documents are increasingly originating on the computer • Documents printed for reading, dissemination, and markup • Paper in the office has increased!! • Goal: Deal with the flow of electronic and paper documents in an efficient and integrated manner • Implication: Unlike computer media, paper documents should be read by both the computer and people

  5. Short Tour of DIA • Field started before digital computers could represent information traditionally appeared on paper • Patents on OCR for telegraph and reading machines for the blind filed in the 19th century and working models demonstrated in 1916 • OCR on specially designed fonts used in 1950s • First postal address reader installed in 1965 • OCRs to read scanned pages came into their own in 1980s with the advent of the low cost microprocessors, bit-mapped displays, and scanners • Large capacity storage devices have now ignited the field with the prospects of Digital Libraries • Document imaging today is a billion dollar industry but document interpretation is only a small part of it

  6. Document Image Analysis Textual Processing Graphical Processing Optical Character Recognition Page Layout Analysis Line Processing Region and Symbol Processing Skew, blocks, paragraphs Lines, curves, corners Filled regions Text

  7. Current • Processors getting faster • Storage costs are down • Pictures are typically 512 x 512 pixels • Speech signals are typically 256 sample points • Business letters are typically 2550 x 3300 pixels at 300 dpi • Eng drawings are typically 34000 x 44000 pixels at 1000 dpi • Digital libraries need WWW interface • Information retrieval and search • OCR accuracy on the rise • Contextual models improved

  8. 300 dpi, 8.5x11 in 255 gray X 3 color 2,550 x 3,300 pixels Document page Data capture 107 pixels Pixel-level processing 7,500 character boxes, 15x20 pixels each 500 line and curve segments, 20 to 20,000 pixels each 10 filled regions 20x20 to 200x200 pixels each Feature-level processing 10x5 region features 7500x10 character features 500x5 line and curve features Text analysis & recognition Graphics analysis & recognition 2 line diagrams, 1 company logo, etc. 1,500 words, 10 paragraphs, 1 title, 2 subtitles, etc. Document Description

  9. Document Image Analysis

  10. Document Taxonomy

  11. Meter Mark Digital Post Mark Sender’s Address Endorsement In Case of Undeliverable as Addressed Return to Sender Linear Code Delivery Address Postal Examples

  12. Forms

  13. Unconstrained Text

  14. Graphics Documents

  15. Personal DL

  16. DAS 02, Princeton, NJ • OCR Features and Systems • Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks, traffic ticket reading • Handwriting Recognition • Stochastic models, holistic methods, Japanese OCR • Classifiers and Learning • Multi-classifier systems • Layout Analysis • Skew correction, geometric methods, test/graphics separation, logical labeling • Tables and Forms • Detecting tables in HTML documents, use of graph grammars, semantics • Text Extraction • Indexing and Retrieval • Document Engineering • New Applications • CAPTCHA, Tachograph chart system, accessing driving directions

  17. ICDAR 03, Edinburgh, UK • Multiple Classifiers • Postal Automation and Check Processing • Document Understanding • HMM Classifiers • Segmentation • Character Recognition • Graphics Recognition • Non-Latin Alphabets- Kanji/Chinese, Korean/Hangul, Arabic/Indian • Web Documents, Video • Word Recognition • Image Processing • Writer Identification • Forms and Tables

  18. CS 661 Class Schedule

  19. Grading • Home Assignments and Quizzes: • 4 x 10 = 40 points • schedule is tentative to preserve surprise element • Based on class participation and paper handouts • Midterm project • Demo: 10% • Report: 15% • Final project • Demo: 10% • Report: 25%

  20. References • Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press • Document Image Analysis, Gorman and Kasturi , IEEE Computer Society Press • International Conference on Document Analysis and Recognition proceedings • International Workshop on Document Analysis Systems proceedings • Symposium on Document Image Understanding Technology

More Related