1 / 23

Document Image Analysis Lecture 5: Metrics

This lecture explores the metrics and modeling involved in document image analysis, including the challenges of character recognition, error rates, and accuracy calculations. It also discusses the importance of correct zoning and document attribute format specifications.

pamm
Download Presentation

Document Image Analysis Lecture 5: Metrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Image AnalysisLecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS294-9 Fall 2000

  2. The course so far…. • Reminder: All course materials are online: http://www-inst.eecs.berkeley.edu/~cs294-9/ • Overview of the DIA Research Field • Some applications (Postal Addresses, Checks): • Research Objectives: more systematic modeling, design • Some basic engineering UC Berkeley CS294-9 Fall 2000

  3. How well are we doing? • Cost to achieve a useful result • Compare digital version to • hand keying/ digitizing • verification • correction • Correction cost may dominate total system cost UC Berkeley CS294-9 Fall 2000

  4. When is a result nearly correct? • Character Model • Correct • Reject • Error • String model • Insertion • Deletion • Rejection • Substitution [wrong letter identification] UC Berkeley CS294-9 Fall 2000

  5. Using ascii character labels ABCDEFGHIJKL = s1 ACD~~OIIUKL = s2 Insert B after A in s2 Substitute E for ~, F for ~ [~=reject] subst G for O in s2 subst H for I in s2 subst I for U … etc (really H was recognized as II, IJ was recognized as U) UC Berkeley CS294-9 Fall 2000

  6. Ascii labels are inadequate • Unicode + • Font + • Point size + • Tag information <author> .. </author> UC Berkeley CS294-9 Fall 2000

  7. Simple measures may mislead Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0? Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected. UC Berkeley CS294-9 Fall 2000

  8. Some errors are acceptable • Keyword search: if the key word occurs many times and is occasionally rejected • Erroneous (nonsense) words are unlikely to be found by a search • Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.) UC Berkeley CS294-9 Fall 2000

  9. Example: UNLV-ISRI document collection • 20 million pages of scientific, legal, official memos from DOE and contractors • Rock mining • Maps • Safe transportation of nuclear waste • Average length 44 pages UC Berkeley CS294-9 Fall 2000

  10. Example: UNLV-ISRI document collection • DOE’s Licensing Support System Prototype • 104,000 Page images, 2,600 documents • Manually typed “correct” text • OCR text • To determine relevance to queries, 3 methods used • Geology students ranking (0/1) • OCR keyword search • “correct” text search UC Berkeley CS294-9 Fall 2000

  11. Example: UNLV-ISRI document collection • Exact match on 71 queries. • 632 returned by correct text • 617 returned by OCR. • Essentially: OCR is OK for this application. • Probabilistic ranking / frequency: • Excessive OCR errors affected ranking • On average, similar results • Feedback on relevance was not helpful for poor OCR • Benchmarking: similar relevance = good results UC Berkeley CS294-9 Fall 2000

  12. Example: UNLV-ISRI document collection One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text. [Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….] UC Berkeley CS294-9 Fall 2000

  13. A theory for computing accuracy • Consider the result of OCR to be a string • Idealization: most common errors involve mis-counting the number of spaces! • Ignores size/font/absolute position etc etc UC Berkeley CS294-9 Fall 2000

  14. Computing the shortest edit distance • Bio-informatics sequencing • Associate a cost for each correspondence. For example, • Match or substitute (cost 0 or 1) • Insert or delete (cost 2) UC Berkeley CS294-9 Fall 2000

  15. A C U G A U G U G A A U G G A A 14 Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?] UC Berkeley CS294-9 Fall 2000

  16. Computing the shortest edit distance • Also useful for other tasks (recognizing speech) • Lots of ways of organization of dynamic programming, still O(n2). • Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.) UC Berkeley CS294-9 Fall 2000

  17. Correct Zoning is essential • Read order in multi-column pages • How to compare competing programs on performance of repeated headers • What to do with figures, logos. 123456 123456 UC Berkeley CS294-9 Fall 2000

  18. Document Attribute Format Specification : DAFS ``While many formats exist for composing a document from electronic storage onto paper, no satisfactory standard exists for the reverse process. DAFS is intended to be a standard for document decomposition. It will used in applications such as OCR and document image understanding. There are three storage formats: DAFS-Unicode, DAFS-ASCII and a more compact DAFS-Binary form. DAFS is a file format specification for documents with a variety of uses. It is developed under the Document Image Understanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese) UC Berkeley CS294-9 Fall 2000

  19. DAFS vs SGML • DAFS= SGML+Unicode +CCITFax4 • SGML requires DTD (document type definition) • SGML is intended for structure, not appearance (e.g. not bold, italic) • Images which accidentally contain ascii version of <tag> can be problematical • Solved by putting images in separate files! UC Berkeley CS294-9 Fall 2000

  20. Perfect results: how to obtain ground truth? • Painfully enter it by hand, or • Painfully correct OCR results, or • Compute some kind of average of OCR programs UC Berkeley CS294-9 Fall 2000

  21. Perfect ground truth: a synthetic approach • (Kanungo,UMD): start with TeX, • produce the ground truth for layout form TeX, • Extract character positions, glyphs by analyzing DVI files • This provides essentially every bit position of each character. UC Berkeley CS294-9 Fall 2000

  22. Ground truth • Next, commit to paper: • Print the DVI files • Scan a calibration page • Compute parameters of 2d2d transformations T imposed by physics • Scan the printout • Align the page • Run the recognizer • Compare reported positions (• T-1 ) to correct ones UC Berkeley CS294-9 Fall 2000

  23. Change of Pace • Assignment 1 • What does it mean to write a program? • Documentation • Demo • Instructions for use • (perhaps optional) • Extensions, limitations, discussion • Discussion questions UC Berkeley CS294-9 Fall 2000

More Related