Metadata Extraction for NASA Collection

Metadata Extraction for NASA Collection June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Outline • Metadata Extraction Project • System overview • Demo • What can ODU do for NASA • Current Status and Required enhancements • Why ODU • Cost Estimate

ODU Metadata Extraction System • Input: pdf documents • processed through OCR (Optical Character Recognition) • Output: metadata in XML format • easily processed for uploading into any database (demo: 1st document)

System Overview • Processing has two main branches: • Documents with forms (RDPs) • Documents without forms

System Overview

Demo (additional documents)

What Can ODU do for NASA • Automate form containing document processing @ NASA site • Automate document processing for 80% of collection with minimal set of metadata • Provide Interface for Human Intervention for remaining 20% • Develop general reporting tool for management on accuracy of process

Current Status • Completely Automated Software for: • Drop in pdf file • Process and produce output metadata in XML format • Easy (less than 5 minutes) installation process • Default set of templates for: • RDP containing documents • Non-form documents • Statistical models of NASA collection (30,000 documents) • Phrase dictionaries: personal authors, corporate authors • Length and English word presence for title and abstract • Structure of dates, report numbers

Current Status Metadata Extraction Results for 25 documents that were randomly selected from the NASA Collection • * Notes • Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted • “Reasonable” implies that values could be automatically processed (see required enhancements) into standard format • Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)

Current Status • Documents with RDP forms • Extracts high-quality metadata for 2 variants of SF-298 • Tested on 154 NASA documents • Documents without RDP forms • Extracts moderate-quality metadata for 9 common document layouts • Tested on 574 NASA documents

Required Enhancements • Develop complete template set • Standardize output and integrate with existing process at NASA site • Provide tutorial for operation and template writing

Required Enhancements • Develop statistical model of target collection • Write default template set to cover at least 80% of known collection • Provide oracle for detection of problem cases

Required Enhancements • Develop interface for showing scoring of output and location in document • Develop interactive modules for correcting metadata • Develop driver for creating output in desired format

Required Enhancements • Develop statistical description of input flow of documents • Develop statistical descriptions of output flow of metadata records • Accuracy • Computer time to process • Human time to validate/correct

Why - software from ODU • Research, new technology • ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM • State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s) • Need for new methods, techniques and processes

Why - software from ODU • Inexpensive (relatively) • ODU is university with low overhead (43%) • Universities can use students and pay them assistantships rather than fulltime salaries • Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work • Faculty are among best in field, require partial funding.

Why - software from ODU • Long term software maintenance through department • Department commits continuity independent of faculty on projects • Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) • Likely that there would be other faculty who are interested in evolving code for appropriate funding

Cost of Possible Project • For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000 • For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000

Metadata Extraction for NASA Collection

Metadata Extraction for NASA Collection

Presentation Transcript

DNS Data and Metadata Extraction

Metadata Extraction

Evaluation of Different Algorithms for Metadata Extraction

WP2: DATA COLLECTION AND METADATA COMPILATION

Toward a Collection-based Metadata Maintenance Model

BMS/ Chace Commercial Multitrack Metadata Collection Project

Spatial Metadata Extraction from Airborne Instrument Data

Data collection, extraction, and central collation

Collection/Item Metadata Relationships

Collection Description Metadata Element Sets

METADATA Decisions for Your Digital Collection

Collection/Item Metadata Relationships

DSpace, ETDs, Automatic Metadata Extraction

Metadata for OBJECTS or metadata for LEARNING?

Metadata Extraction @ ODU for DTIC

Preservation Metadata Extraction and Collection : Tools and Techniques

SnapDragon Collection Management Shareable Metadata for Images

Keyword extraction for metadata annotation of Learning Objects

Web data extraction solution for systematic online data collection