1 / 18

Automated Form processing for DTIC Documents

Automated Form processing for DTIC Documents. March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil. Outline. Overall process for handling documents in batches Issues Results Conclusion. Overall process for handling documents in batches. start. 1.

Download Presentation

Automated Form processing for DTIC Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Form processing for DTIC Documents • March 20, 2006 • Presented By, • K. Maly, M. Zubair, S. Zeil

  2. Outline • Overall process for handling documents in batches • Issues • Results • Conclusion

  3. Overall process for handling documents in batches

  4. start 1 • Omnipage xml document having 10 pages (first 5 and last 5). • Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched. • Determined by the ratio of the number of fields matched over the total number of fields. Read Next Page Have candidates? no pages left yes no no yes Match against all form templates Get the best one 3 Extract metadata Matched Templates # >0 no Move to “unresolved” folder Store “resolved” results 2 yes Add the page with its templates into candidate set End Figure: Flowchart of Processing One Document

  5. Issues in form based metadata extraction Results Of 246 Documents Results Of 100 Documents

  6. Forms are missing some obvious fields • For example in the following document, the POINT page (first page) has the author, but • the form doesn’t. • http://128.82.7.208:9090/dtic/newdocs/sf298/formdocs/pdfs/ADA425677.pdf

  7. In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.

  8. The following has no form caption. If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.

  9. The following form spans on two pages. After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.

  10. In the following form we have word boundary detection errors for metadata field names. For example, “4. TITLE AND SUBTITLE” appears as “4 . T ITL E A ND SUB TI TL E”. (We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))

  11. Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”. Here we recognize the field name by matching it part by part. If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.

  12. Coverage Type Missing In the Original Document The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”

  13. Identified as sf298_1 The current templates identified this form but failed to extract because this wasa new kind of form and we can handle this case by writing a new template.

  14. OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

  15. OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

  16. Results of 264 Documents We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form. For any new forms templates can be written to handle them. Following are the recall and precision results based on 264 documents.

  17. Results of 100 Documents

  18. Conclusion • Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents. • We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors.

More Related