Metadata Extraction @ ODU for DTIC

Metadata Extraction @ ODUforDTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Outline • Metadata Extraction Project • System overview • Demo • Current status • Why ODU • Research, new technology, Inexpensive, Maintenance (Department commitment) • Why DTIC as Lead • Amortize development cost, Expand template set (helpful in future too), Consistent with DTIC strategic mission • Required enhancements

ODU Metadata Extraction System • Input: pdf documents • processed through OCR (Optical Character Recognition) • Output: metadata in XML format • easily processed for uploading into DTIC databases (demo: 1st document)

System Overview • Processing has two main branches: • Documents with forms (RDPs) • Documents without forms

System Overview

Demo (additional documents)

Documents With RDP Forms • Status • Extracts high-quality metadata for 7 variants of SF-298 and 1 less common RDP form • Tested on over 9000 (unclassified) DTIC documents • Major needs: • Validation & standardization of output

Documents Without Forms • Status • Extracts moderate-quality metadata for 10 common document layouts • Tested on over 600 (unclassified) DTIC documents • Major needs: • Validation & standardization of output • Extraction Engine Enhancements • Expansion of template set to cover most common document layouts

Status • Completely Automated Software for: • Drop in pdf file • Process and produce output metadata in XML format • Easy (less than 5 minutes) installation process • Default set of templates for: • RDP containing documents • Non-form documents • Statistical models of DTIC collection(800,000 documents) and NASA collection (30,000 documents) • Phrase dictionaries: personal authors, corporate authors • Length and English word presence for title and abstract • Structure of dates, report numbers

Status Metadata Extraction Results for 98 documents that were randomly selected from the DTIC Collection • * Notes • Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted • “Reasonable” implies that values could be automatically processed (see required enhancements) into standard format • Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)

Why - software from ODU • Research, new technology • ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM • State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as DTIC’s) • Need for new methods, techniques and processes

Why - software from ODU • Inexpensive (relatively) • ODU is university with low overhead (43%) • Universities can use students and pay them assistantships rather than fulltime salaries • Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work • Faculty are among best in field, require partial funding.

Why - software from ODU • Long term software maintenance through department • Department commits continuity independent of faculty on projects • Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) • Likely that there would be other faculty who are interested in evolving code for appropriate funding

Why – DTIC as Lead Agency • Amortize Development Cost • We are working with NASA and plan to get on GPO board soon. NASA gave us partial funding to investigate the applicability of our approach for their collection.

Why – DTIC as Lead Agency • Cross Fertilization • DTIC has distinctive requirements that can benefit from enhancing the metadata extraction technology for other agencies (for example richer template set) • Heterogeneity: DTIC collects documents… • of many different types • from an unusually large number of sources • with minimal format restrictions • Evolution:DTIC collection spans time frame in which • submission formats change from typewritten to word processed, scanned to electronic • asserts minimal control over layouts & formats

Why – DTIC as Lead Agency • Consistent with DTIC Strategic Mission • DTIC is largest organization with most diverse collection and has stature to disseminate to other government agencies

Required Enhancements – Priority 1 • Enhance portability • Standardized output • Template creation (initial release), • Text PDF input • MS Word input

Required Enhancements – Priority 2 • PrimeOCR input • Multipage metadata • Template Creation (enhanced release) • Template Creation Tool

Required Enhancements – Priority 3 • Human intervention software

Time Line • May 2007 to September 2007 • Add flexibility to code • Enable the current product to produce standardized output • Create new templates that will cover the Larger Contributors • Investigate different approaches to handle text pdf documents and finalize the design?

Time Line • October 2007 to September 2008 • Validate the extraction according to the DTIC provided Cataloging document . • Module that would allow the functional user to create a new template that would easily integrate into the extraction software. • Create new templates that will cover the Larger Contributors of DTIC • Create a module that converts Prime OCR into IDM • Create the code necessary to enable the non-form documents to be able to extract the metadata from more than one single page • Implement the support for the text pdf as finalized in the first part • Implement support for Word documents • Create the code necessary to display the scoring on validation at the documents level (for workers) and collection level (managers)

Extra slides

Sample RDP

Sample RDP (cont.)

Metadata Extracted from Sample RDP (1/3) <metadata templateName="sf298_2"> <ReportDate>18-09-2003</ReportDate> <DescriptiveNote>Final Report</DescriptiveNote> <DescriptiveNote>1 April 1996 - 31 August 2003</DescriptiveNote> <UnclassifiedTitle>VALIDATION OF IONOSPHERIC MODELS</UnclassifiedTitle> <ContractNumber>F19628-96-C-0039</ContractNumber> <ContractNumber></ContractNumber> <ProgramElementNumber>61102F</ProgramElementNumber> <PersonalAuthor>Patricia H. Doherty Leo F. McNamara Susan H. Delay Neil J. Grossbard</PersonalAuthor> <ProjectNumber>1010</ProjectNumber> <TaskNumber>IM</TaskNumber> <WorkUnitNumber>AC</WorkUnitNumber> <CorporateAuthor>Boston College / Institute for Scientific Research 140 Commonwealth Avenue Chestnut Hill, MA 02467-3862</CorporateAuthor>

Metadata Extracted from Sample RDP (2/3) <ReportNumber></ReportNumber> <MonitorNameAndAddress>Air Force Research Laboratory 29 Randolph Road Hanscom AFB, MA 01731-3010</MonitorNameAndAddress> <MonitorAcronym>VSBP</MonitorAcronym> <MonitorSeries>AFRL-VS-TR-2003-1610</MonitorSeries> <DistributionStatement>Approved for public release; distribution unlimited.</DistributionStatement> <Abstract>This document represents the final report for work performed under the Boston College contract F I9628-96C-0039. This contract was entitled Validation of Ionospheric Models. The objective of this contract was to obtain satellite and ground-based ionospheric measurements from a wide range of geographic locations and to utilize the resulting databases to validate the theoretical ionospheric models that are the basis of the Parameterized Real-time Ionospheric Specification Model (PRISM) and the Ionospheric Forecast Model (IFM). Thus our various efforts can be categorized as either observational databases or modeling studies.</Abstract>

Metadata Extracted from Sample RDP (3/3) <Identifier>Ionosphere, Total Electron Content (TEC), Scintillation, Electron density, Parameterized Real-time Ionospheric Specification Model (PRISM), Ionospheric Forecast Model (IFM), Paramaterized Ionosphere Model (PIM), Global Positioning System (GPS)</Identifier> <ResponsiblePerson>John Retterer</ResponsiblePerson> <Phone>781-377-3891</Phone> <ReportClassification>U</ReportClassification> <AbstractClassification>U</AbstractClassification> <AbstractLimitaion>SAR</AbstractLimitaion> </metadata>

Non-Form Sample (1/2)

Non-Form Sample (2/2)

Metadata Extracted From the Title Page of the Sample Document <paper templateid="au"> <identifier>AU/ACSC/012/1999-04</identifier> <CorporateAuthor>AIR COMMAND AND STAFF COLLEGE AIR UNIVERSITY</CorporateAuthor> <UnclassifiedTitle>INTEGRATING COMMERCIAL ELECTRONIC EQUIPMENT TO IMPROVE MILITARY CAPABILITIES </UnclassifiedTitle> <PersonalAuthor>Jeffrey A. Bohler LCDR, USN</PersonalAuthor> <advisor>Advisor: CDR Albert L. St.Clair</advisor> <ReportDate>April 1999</ReportDate> </paper>

Enhanced Portability • Relax hard-coded system dependencies • Less technical documentation, particularly as regards operational procedure • Improved error logging • Priority: 1 • duration: 2 mos, • impact: easier to operate software,

Standardized Output • WYSIWYG • What You See is What You Get • WYG != WYW • What You Get is not necessarily What You Want

Standardized Output (cont.) • Field values to adhere to defined standard: • Title in ‘title’ format ala: This is a Title Well Formed • Date ala: 28 MAR 2007 • Personal authors ala: Leo F. McNamara ;Susan H. Delay ;Neil J. Grossbard • Contract/grant number, corporate authors, distribution statement,.. • Priority: 1 • duration: 3 mos, • impact: better template selection and metadata ready for DB insertion • Dependency: none

Template Creation (initial release) • For RDP relative few (5 templates cover 100% of about 9,000 out of 10,000 in testbed) more needed. • For documents without RDP need more (currently have 10 templates covering 600 non-RDP documents) to cover largest DTIC contributors • Requires acquiring and exploiting an updated testbed • from last three years • documents as they arrived at DTIC • need about 5,000 documents • Template set to be enhanced still further in later stages • Priority – 1 • duration: 4 mos, • impact: closer to production stage, • dependency: new testbed

Text PDF Input • Current system processes all documents through OCR • allows input of documents that arrive as scanned images • time consuming • source of error • Increasing percentage of new DTIC documents arrive as “native” or “text” PDF • Add processing path to accept text PDF without OCR • Priority: 1 • Duration: 6 months

MS Word Input • Could be handled via Word ML or by generating Text PDFs from Word • Need solution imposing minimal additional requirements on operating platform • Priority: 1 • Duration: 2 months

Required Enhancements • Desirable (Priority 2) • PrimeOCR input • Multipage metadata • Template Creation • Template Creation Tool • Optional (Priority 3) • Human intervention software

Current System (Detailed)

Status – Distribution of Documents Distribution of documents with RDP Distribution of documents without RDP

Input Processing • OCR – Omnipage update radically changed XML output • Details later • Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 • suspended efforts at more sophisticated POINT page location

Form Processing • Bug fixes and Tuning • Omnipage XML converted to IDM • Main form template engine rewritten to work from IDM

Independent Document Model (IDM) • Platform independent Document Model • Motivation • Dramatic XML Schema Change between Omnipage 14 and 15 • Tie the template engine to stable specification • Protects from linking directly to specific OCR product • Allows us to include statistics for enhanced feature usage • Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Generating IDM • Use XSLT 2.0 stylesheets to transform • Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change • Chain a series of sheets to add functionality (CleanML) • Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)

IDM Usage • Each incoming XML schema requires specific XSLT 2.0 Stylesheet • Resulting IDM Doc used for “Form Based” templates • IDM transformed into CleanML for “Non-form” templates OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Other OCR Output XML Doc Non Form Extraction

IDM Tool Status • Converters completed to generate IDM from Omnipage 14 and 15 XML • Omnipage 15 proved to have numerous errors in its representation of an OCR’d document • Consequently, not recommended • Form-based extraction engine revised to work from IDM • Non-form engine still works from our older “CleanXML” • convertor from IDM to CleanXML completed as stop-gap measure • direct use of IDM deferred pending review of other engine modifications

Post Processing • No significant changes

Nonform Processing • Bug fixes & tuning • Added new validation component • Post-hoc classification • replaces former a priori classification schemes

Validation • Given a set of extracted metadata • mark each field with a confidence value indicating how trustworthy the extracted value is • mark the set with a composite confidence score • Fields and Sets with low confidence scores may be referred for additional processing • automated post-processing • human intervention and correction

Validating Extracted Metadata • Techniques must be independent of the extraction method • A validation specification is written for each collection, combining • Field-specific validation rules • statistical models derived for each field of • text length • % of words from English dictionary • % of phrases from knowledge base prepared for that field • pattern matching

Sample Validation Specification • Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>

Metadata Extraction @ ODU for DTIC

Metadata Extraction @ ODU for DTIC

Presentation Transcript

DNS Data and Metadata Extraction

Metadata Extraction

Evaluation of Different Algorithms for Metadata Extraction

ODU

Spatial Metadata Extraction from Airborne Instrument Data

ODU Social Entrepreneurship

Research at the DTIC

U-ODU Fit

ODU

Tools for Extracting Metadata and Structure from DTIC Documents

DSpace, ETDs, Automatic Metadata Extraction

Metadata for OBJECTS or metadata for LEARNING?

Metadata Extraction @ ODU for DTIC

Preservation Metadata Extraction and Collection : Tools and Techniques

ODU Cavity + Extremities

Sample ODU Presentation

The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu

Automated Form processing for DTIC Documents

Keyword extraction for metadata annotation of Learning Objects

ODU Assembly

Compressor Replacement – ODU