1 / 18

Metadata Extraction for NASA Collection

Metadata Extraction for NASA Collection. June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu. Outline. Metadata Extraction Project System overview Demo What can ODU do for NASA Current Status and Required enhancements Why ODU Cost Estimate.

zorion
Download Presentation

Metadata Extraction for NASA Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Extraction for NASA Collection June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

  2. Outline • Metadata Extraction Project • System overview • Demo • What can ODU do for NASA • Current Status and Required enhancements • Why ODU • Cost Estimate

  3. ODU Metadata Extraction System • Input: pdf documents • processed through OCR (Optical Character Recognition) • Output: metadata in XML format • easily processed for uploading into any database (demo: 1st document)

  4. System Overview • Processing has two main branches: • Documents with forms (RDPs) • Documents without forms

  5. System Overview

  6. Demo (additional documents)

  7. What Can ODU do for NASA • Automate form containing document processing @ NASA site • Automate document processing for 80% of collection with minimal set of metadata • Provide Interface for Human Intervention for remaining 20% • Develop general reporting tool for management on accuracy of process

  8. Current Status • Completely Automated Software for: • Drop in pdf file • Process and produce output metadata in XML format • Easy (less than 5 minutes) installation process • Default set of templates for: • RDP containing documents • Non-form documents • Statistical models of NASA collection (30,000 documents) • Phrase dictionaries: personal authors, corporate authors • Length and English word presence for title and abstract • Structure of dates, report numbers

  9. Current Status Metadata Extraction Results for 25 documents that were randomly selected from the NASA Collection • * Notes • Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted • “Reasonable” implies that values could be automatically processed (see required enhancements) into standard format • Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)

  10. Current Status • Documents with RDP forms • Extracts high-quality metadata for 2 variants of SF-298 • Tested on 154 NASA documents • Documents without RDP forms • Extracts moderate-quality metadata for 9 common document layouts • Tested on 574 NASA documents

  11. Required Enhancements • Develop complete template set • Standardize output and integrate with existing process at NASA site • Provide tutorial for operation and template writing

  12. Required Enhancements • Develop statistical model of target collection • Write default template set to cover at least 80% of known collection • Provide oracle for detection of problem cases

  13. Required Enhancements • Develop interface for showing scoring of output and location in document • Develop interactive modules for correcting metadata • Develop driver for creating output in desired format

  14. Required Enhancements • Develop statistical description of input flow of documents • Develop statistical descriptions of output flow of metadata records • Accuracy • Computer time to process • Human time to validate/correct

  15. Why - software from ODU • Research, new technology • ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM • State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s) • Need for new methods, techniques and processes

  16. Why - software from ODU • Inexpensive (relatively) • ODU is university with low overhead (43%) • Universities can use students and pay them assistantships rather than fulltime salaries • Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work • Faculty are among best in field, require partial funding.

  17. Why - software from ODU • Long term software maintenance through department • Department commits continuity independent of faculty on projects • Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) • Likely that there would be other faculty who are interested in evolving code for appropriate funding

  18. Cost of Possible Project • For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000 • For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000

More Related