1 / 13

NASA Feasibility Study Status Update

NASA Feasibility Study Status Update. NASA Milestones. A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006

rane
Download Presentation

NASA Feasibility Study Status Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NASA Feasibility StudyStatus Update September 25, 2006

  2. NASA Milestones A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006 C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006 D. Process study for inter-organizational collections – configuration software – Dec 1, 2006 E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006 F. Evaluation of extraction process – report – Feb 28,2006 September 25, 2006

  3. Form Identification and Template Development August 31 Deliverable September 25, 2006

  4. Form Identification and Template Development August 31 Deliverable DEMO September 25, 2006

  5. Active Tasks for future NASA Milestones Standard Intermediate Representation of the Scanned Document (IDM) Design Classification Algorithm September 25, 2006

  6. Independent Document Model (IDM) • Platform independent Document Model • Motivation • Dramatic XML Schema Change between Omnipage 14 and 15 • Tie the template engine to stable specification • Protects from linking directly to specific OCR product • Allows us to include statistics for enhanced feature usage • Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..) • Supports Pointpage Detection, Classification • Use XSLT 2.0 stylesheets to transform • Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change • Chain a series of sheets to add functionality (CleanML) September 25, 2006

  7. OmniPage 14 XML Doc docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl Other OCR Output XML Doc IDM Usage • Each incoming XML schema requires specific XSLT 2.0 Stylesheet • Resulting IDM Doc used for “Form Based” templates • IDM transformed into CleanML for “Non-form” templates Form Based Extraction CleanML XML Doc Non Form Extraction September 25, 2006

  8. Classification Algorithm • Two approaches: • Classification(switching) based on image classification • Post-hoc classification via validation September 25, 2006

  9. Post-hoc classification via validation • Attempt metadata extraction with all plausible templates • Validate each results set, assigning confidence scores • Field-specific validation rules, may combine - statistical models derived for each field of - text length - % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching • Select metadata set with highest confidence score September 25, 2006

  10. Sample set of extracted metadata bindings <metadata> <author>Steven J. Zeil</author> <organization>Old Dominion University Technical Report 2006-24</organization> <reportDate>September 12, 2006</reportDate> <title>Validation of Extracted Metadata</title> <abstract>A lengthy discussion of techniques for validating metadata is </abstract> </metadata> September 25, 2006

  11. Validation template customized for the collection <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="author"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> September 25, 2006

  12. <val:field name="organization"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> <val:field name="reportNumber"> <val:max> <val:regexp pattern="Technical Report +\d\d\d\d-\d\d"/> </val:max> </val:field> <val:field name="reportDate"> <val:max> <val:dateFormat/> </val:max> </val:field> <val:field name="abstract"> <val:min> <val:length/> <val:dictionary/> </val:min> </val:field> </val:average> </val:validate> September 25, 2006

  13. Annotated version of the metadata bindings <metadata confidence="0.59"> <author confidence="0.85">Steven J. Zeil</author> <organization confidence="0.42" warning="inappropriate vocabulary">Old Dominion University Technical Report 2006-24</organization> <reportDate confidence="1.0">September 12, 2006</reportDate> <title confidence="1.0">Validation of Extracted Metadata</title> <abstract confidence="0.3" warning="Unusually short"> A lengthy discussion of techniques for validating metadata is </abstract> </metadata> September 25, 2006

More Related