1 / 48

A Training and Classification System in Support of Automated Metadata Extraction

A Training and Classification System in Support of Automated Metadata Extraction. PhD Proposal Paul K Flynn 14 May 2009. Motivation. bootstrap. Overview . Background Metadata Extraction Overview System Description Nonform classification Proposal Classification Background

caspar
Download Presentation

A Training and Classification System in Support of Automated Metadata Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Training and Classification System in Support of Automated Metadata Extraction PhD Proposal Paul K Flynn 14 May 2009

  2. Motivation • bootstrap

  3. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  4. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  5. Metadata Extraction System • Large, heterogeneous, evolving collections consisting of documents with diverse layout and structure • Defense Technical Information Center (DTIC) • U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA) • Two types of targets • Forms – documents containing a standardized form filled out with metadata of interest • Non-forms – all others • Developing a system which can be transitioned into a production system

  6. Form Document

  7. Nonform Document

  8. Design concepts • Heterogeneity • A new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collections • Associated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields. • Evolution • New classes of documents accommodated by writing a new template • templates are comparatively simple, no lengthy retraining required • potentially rapid response to changes in collection • Robustness • Use of Validation techniques to detect extraction problems and selection of templates

  9. Architecture & Implementation

  10. Validation Scripts Final Validation Classification

  11. Post-Hoc Classification • Apply all templates to document • results in multiple candidate sets of metadata • Score each candidate using the validator • Select the best-scoring set

  12. Post hoc classification shortcomings Selected Correct

  13. Classification (a priori) • Replace Post hoc classification alone with a new classification module • Continue to use Validator to provide semantic verification of extracts

  14. Focus: Non-Form Processing • Classification – compare document against known document layouts • Select template written for closest matching layout • Apply non-form extraction engine to document and template • Send to validator for scoring

  15. Extraction System

  16. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  17. Proposal • Investigate Classification methodologies and implementation • Create Training System for managing and creating templates • Specific questions we will attempt to answer: • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates? • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system? • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development? • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?

  18. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  19. Previous Research • Extensive coverage in literature • Model and methodology match purpose • No universal classifier • Many experiments use limited number of classes • Use visual similarity, logical structures, or textual features • Machine learning • Decision trees • Distance measures • Multiple classifiers

  20. Issues and Complexities • Self-imposed constraints • Must be simple to maintain • Automated process for deploying from Training System • Avoid adding dependency on additional 3rd party packages

  21. Unpredictable OCR Results

  22. Unpredictable Page Segmentation

  23. Second page relevancy Page 1 Page 2

  24. Manual Classification • Documents may appear visually similar at thumbnail scale • Closer inspection reveals semantic differences

  25. Incorrect Manual Classification Position of Date Field different – detectable by post hoc classification

  26. Initial experiments • Methods tested • Block distances – Tries to match blocks and measure distances • MxN Overlap – divide page into bins and count matches • Common Vocab – find common words in pages of training class • Vocab1 – looks at only 1st page • Vocab5 – looks at 1st five pages • MXY tree – variant encodes structure as string, uses edit-distance to measure similarity • MXY + MxN – sums two methods • Used best 4/5 votes to declare match

  27. Summary Sample Results MXY Tree

  28. Summary Sample Results MxN Block Distance MXY Tree MXY Tree -MxN

  29. Experimental Results • Precision = #correct / #total in class • Recall = #correct / #answers

  30. Experimental Results • Precision = #correct / #total in class • Recall = #correct / #answers

  31. Avenues of Investigation • Implement and test variety of method • Handling multiple page classification • Multiple Classifiers • Methods of combining • Weighting best for each class • Deriving signature for specifying combination rules • Clustering methods for bootstrapping

  32. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  33. Training System • Manual classification • Inaccurate • Time consuming • Not dynamic • Need automated verification of extractions • Developers open original documents multiple times • Correctness open to interpretation • Regression testing only compares against previous extraction attempts • Need to allow multiple template writers to interact • Need to measure effects of individual templates against the whole

  34. Metadata Training System Templates and Classification Signatures Training Evaluator Production System Clustered Docs Candidates Template Maker BootStrap Classifier Baseline Data Pool Trained Pool Baseline Data Manager Training Docs

  35. Persistence Layer • Manage complete set of Training documents • Track Baseline data input by users • Allows for independent confirmation • Change tracking • Track subset of documents with out templates developed • Track trained documents • Allow for multiple access

  36. Baseline Data Manager • GUI for establishing Baseline • Highlight and copy • Work on OCR to account for errors • Provide auditing and tracking of changes

  37. Bootstrap Classifier • Dynamic GUI to allow: • Differing classifiers • Clustering method • User can flag documents to ignore • User can designate single doc as matcher • Provides output to Template Maker

  38. Template Maker Results for document Template being developed

  39. Template Maker • Final form depends on Engine replacement • Sample documents come from Bootstrap Classifier • Also prepares classification spec for export to production system

  40. Training Evaluator • Verifies operation of template and classification spec • Identifies other documents in the pool which are correctly extracted • Moved to trained pool • Removed from further bootstrapping • Supports robust regression testing

  41. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  42. Testing and Evaluation • Proposed Tests • Evaluate effectiveness of pre-classification module  • Evaluate effectiveness of adding similarity score to validation • Evaluate the effectiveness of the bootstrap classification • End to End evaluation •  Other tests as needed

  43. Evaluate effectiveness of pre-classification module • Use simple Baseline classifier to test • Select candidate templates • Adjust post hoc score • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?

  44. Evaluate effectiveness of adding similarity score to validation • Use Baseline classifier • Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. • Apply score to final validation • Remove templates to measure effects • Assessment: Evaluate the percent of documents which are correctly flagged as resolved. • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?

  45. Evaluate the effectiveness of the bootstrap classification • Measure amount of time to completely run thru training collection • Isolate template development time by providing appropriate template • Assessment: Compare the time needed to classify the documents to the manual method. • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?

  46. End to End evaluation • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection? • Create a mini-collection by downloading 100 documents from DTIC. • Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents. • One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. • The other team will use the complete training system. • Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.

  47. Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

  48. Schedule

More Related