Metadata Extraction @ ODU for DTIC

Metadata Extraction @ ODUforDTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Outline • Metadata Extraction Project • System overview • Demo • Current status • Why ODU • Research, new technology, Inexpensive, Maintenance (Department commitment) • Why DTIC as Lead • Amortize development cost, Expand template set (helpful in future too), Consistent with DTIC strategic mission • Required enhancements

System Overview

Demo

Status • Completely Automated Software for: • Drop in pdf file • Process and produce output metadata in XML format • Easy (less than 5 minutes) installation process • Default set of templates for: • RDP containing documents • Non-form documents • Statistical models of DTIC collection(800,000 documents) and NASA collection (30,000 documents) • Phrase dictionaries: personal authors, corporate authors • Length and English word presence for title and abstract • Structure of dates, report numbers

Status Metadata Extraction Results for 98 documents that were randomly selected from the DTIC Collection • * Notes • Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted • “Reasonable” implies that values could be automatically processed (see required enhancements) into standard format • Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)

Why - software from ODU • Research, new technology • ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM • State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as DTIC’s) • Need for new methods, techniques and processes

Why - software from ODU • Inexpensive (relatively) • ODU is university with low overhead (43%) • Universities can use students and pay them assistantships rather than fulltime salaries • Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work • Faculty are among best in field, require partial funding.

Why - software from ODU • Long term software maintenance through department • Department commits continuity independent of faculty on projects • Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) • Likely that there would be other faculty who are interested in evolving code for appropriate funding

Why – DTIC as Lead Agency • Amortize Development Cost • We are working with NASA and plan to get on GPO board soon. NASA gave us partial funding to investigate the applicability of our approach for their collection.

Why – DTIC as Lead Agency • Cross Fertilization • DTIC has distinctive requirements that can benefit from enhancing the metadata extraction technology for other agencies (for example richer template set) • Heterogeneity: DTIC collects documents… • of many different types • from an unusually large number of sources • with minimal format restrictions • Evolution:DTIC collection spans time frame in which • submission formats change from typewritten to word processed, scanned to electronic • asserts minimal control over layouts & formats

Why – DTIC as Lead Agency • Consistent with DTIC Strategic Mission • DTIC is largest organization with most diverse collection and has stature to disseminate to other government agencies

Required Enhancements – Priority 1 • Enhance portability • Standardized output • Template creation (initial release), • Text PDF input • MS Word input

Required Enhancements – Priority 2 • PrimeOCR input • Multipage metadata • Template Creation (enhanced release) • Template Creation Tool

Required Enhancements – Priority 3 • Human intervention software

Time Line • May 2007 to September 2007 • Add flexibility to code • Enable the current product to produce standardized output • Create new templates that will cover the Larger Contributors • Investigate different approaches to handle text pdf documents and finalize the design?

Time Line • October 2007 to September 2008 • Validate the extraction according to the DTIC provided Cataloging document . • Module that would allow the functional user to create a new template that would easily integrate into the extraction software. • Create new templates that will cover the Larger Contributors of DTIC • Create a module that converts Prime OCR into IDM • Create the code necessary to enable the non-form documents to be able to extract the metadata from more than one single page • Implement the support for the text pdf as finalized in the first part • Implement support for Word documents • Create the code necessary to display the scoring on validation at the documents level (for workers) and collection level (managers)

Metadata Extraction @ ODU for DTIC

Metadata Extraction @ ODU for DTIC

Presentation Transcript

DNS Data and Metadata Extraction

Metadata Extraction

Evaluation of Different Algorithms for Metadata Extraction

ODU

Spatial Metadata Extraction from Airborne Instrument Data

ODU Social Entrepreneurship

Research at the DTIC

U-ODU Fit

ODU

ODU Formula SAE

Tools for Extracting Metadata and Structure from DTIC Documents

DSpace, ETDs, Automatic Metadata Extraction

Metadata for OBJECTS or metadata for LEARNING?

Preservation Metadata Extraction and Collection : Tools and Techniques

ODU Cavity + Extremities

Sample ODU Presentation

The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu

Automated Form processing for DTIC Documents

Keyword extraction for metadata annotation of Learning Objects

ODU Assembly

Compressor Replacement – ODU