1 / 28

Metadata Extraction

Metadata Extraction. Progress Report 12/14/2006. Outline. System Overview Detailed Structure with Recent Changes IDM representation of documents validation & post-hoc classification Status of Recent & Upcoming Deliverables Future Directions. System Overview.

Download Presentation

Metadata Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Extraction Progress Report 12/14/2006

  2. Outline • System Overview • Detailed Structure with Recent Changes • IDM representation of documents • validation & post-hoc classification • Status of Recent & Upcoming Deliverables • Future Directions

  3. System Overview

  4. Detailed Structure with Recent Changes • Input Processing • Form Processing • Post Processing • Nonform Processing

  5. Input Processing • OCR – Omnipage update radically changed XML output • Details later • Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 • suspended efforts at more sophisticated POINT page location

  6. Form Processing • Bug fixes and Tuning • Omnipage XML converted to IDM • Main form template engine rewritten to work from IDM

  7. Independent Document Model (IDM) • Platform independent Document Model • Motivation • Dramatic XML Schema Change between Omnipage 14 and 15 • Tie the template engine to stable specification • Protects from linking directly to specific OCR product • Allows us to include statistics for enhanced feature usage • Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

  8. Generating IDM • Use XSLT 2.0 stylesheets to transform • Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change • Chain a series of sheets to add functionality (CleanML) • Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)

  9. IDM Usage • Each incoming XML schema requires specific XSLT 2.0 Stylesheet • Resulting IDM Doc used for “Form Based” templates • IDM transformed into CleanML for “Non-form” templates OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Other OCR Output XML Doc Non Form Extraction

  10. IDM Tool Status • Converters completed to generate IDM from Omnipage 14 and 15 XML • Omnipage 15 proved to have numerous errors in its representation of an OCR’d document • Consequently, not recommended • Form-based extraction engine revised to work from IDM • Non-form engine still works from our older “CleanXML” • convertor from IDM to CleanXML completed as stop-gap measure • direct use of IDM deferred pending review of other engine modifications

  11. Post Processing • No significant changes

  12. Nonform Processing • Bug fixes & tuning • Added new validation component • Post-hoc classification • replaces former a priori classification schemes

  13. Validation • Given a set of extracted metadata • mark each field with a confidence value indicating how trustworthy the extracted value is • mark the set with a composite confidence score • Fields and Sets with low confidence scores may be referred for additional processing • automated post-processing • human intervention and correction

  14. Validating Extracted Metadata • Techniques must be independent of the extraction method • A validation specification is written for each collection, combining • Field-specific validation rules • statistical models derived for each field of • text length • % of words from English dictionary • % of phrases from knowledge base prepared for that field • pattern matching

  15. Sample Validation Specification • Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>

  16. Validation Spec: Field Tests • Each field is subjected to one or more tests … <val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max> <val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/> </val:max> </val:average> </val:field> <val:field name="ReportDate"> <val:reportFormat/> </val:field> ...

  17. Sample Input Metadata Set <metadata> <UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

  18. Sample Validator Output <metadata confidence="0.522"> <UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

  19. Classification (a priori) • Previously, we had attempted various schemes for a priori classification • x-y trees • bin classification • Still investigating some • visual recognition

  20. Post-Hoc Classification • Apply all templates to document • results in multiple candidate sets of metadata • Score each candidate using the validator • Select the best-scoring set

  21. Demo & Experimental Results • Results of 157 documents http://128.82.7.147:8080/dtic/validsum157.jsp

  22. Future Directions

  23. Status of Recent & Upcoming Deliverables • DTIC - Classifier Development (9/19/06) • NASA - Enhance classification algorithm for two specific classes   (10/31/2006) • NASA - Process study for inter-organizational collections   – configuration software – (12/1/2006) • NASA - Enhance engine to recognize two major classes   (Dec 15, 2006)

  24. Classifier Development • DTIC - Classifier Development (9/19/06) • NASA - Enhance classification algorithm for two specific classes   (10/31/2006) • Delayed by difficulties with a priori classification schemes • Now replaced by post hoc validation-based classification • some tuning of validation spec required • cleaning of metadata sources for statistical models • Demo posted 11/15/2006

  25. Configuration • NASA - Process study for inter-organizational collections   (12/1/2006) • extraction engines differentiate by collection-dependent template sets • validation specifications take collection name as a required attribute • used to locate distinct statistical models built for that collection • Regression test framework established • protects against changes or tuning to one collection degrading performance on others

  26. Engine Enhancements • NASA - Enhance engine to recognize two major classes   (12/15/2006) • in many ways, already satisfied • most planned enhancements deferred due to work on IDM • in short term, emphasis will be on expanding the template set to exploit existing engine features and availability of new post-hoc classifier

  27. END • Questions?

  28. Current System (Detailed)

More Related