1 / 18

Stephen Abrams California Digital Library Stephen.Abrams@ucop

Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization. Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu. Characterization. /ker-ik-t(ə-)rə-zā ' -shən/ noun 1. The action or result of characterizing.

ceana
Download Presentation

Stephen Abrams California Digital Library Stephen.Abrams@ucop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preservation and Archiving Special Interest Group Spring MeetingSan Francisco, 27-29 May 2008Preservation Characterization Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu

  2. Characterization • /ker-ik-t(ə-)rə-zā'-shən/ noun 1. The action or result of characterizing. 2. Description of characteristics or essential features.

  3. Characterization • Knowing what you have, as a stable starting point for iterative preservation analysis, planning, and action Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 1:2 (June 2007).

  4. What? So what? • What do you have? • Identification • Feature extraction • Conformance • What should you do with what you have? • Assessment

  5. Ingest workflow

  6. Migration workflow

  7. Two approaches to characterization • Implicit • Custom grammars defining a single format processed by a generic engine that understands all grammars • Unix file • National Archives (UK) DROID • Open Grid Forum DFDL • Planets XCEL/XCDL • Explicit • Plug-in framework with custom modules that each understand a single format • NLNZ Metadata Extractor • JHOVE

  8. Why choose one over the over? • Implicit • Pro More sustainable in the long term • Con Is the formal notation rich enough to capture all nuances of formats of interest? • Explicit • Pro It’s just programming • Con It’s more programming

  9. JHVE • Extensible framework for format identification, validation, and characterization • Pluggable format-specific modules for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, UTF-8, XML • PDF • GUI, command-line, and Java API • Collaborative project of Harvard University and the JSTOR Electronic-Archive Initiative • Funded by Andrew W. Mellon Foundation • GNU LGPL license

  10. JHVE2 • A next generation architecture for format-aware preservation processing • Three-fold goals: • Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement • Provide significant new function • (Re-) Implement modules • Collaborative project of CDL, Portico, and Stanford University • Funded by Library of Congress/NDIIPP • Open source BSD license

  11. JHVE2 enhancements • JHOVE assumed 1 object = 1 file = 1 format • But what about… • TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats • JPEG 2000 JPX fragmentation 1 object = n files = 1 format • ESRI Shapefile 1 object = 3 files = 3 formats • JHOVE2 will support 1 object = n files = m formats

  12. JHVE2enhancements • Generic plug-in interface • Configurable set of modules iteratively invoked against each object • Inter-module memory structure for stateful processing • Identification de-coupled from conformance • Standardized handling of format profiles and error reporting • Configurable conformance criteria • API level support for limited editing

  13. JHVE2 modules • Identification • Feature extraction and conformance for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics

  14. JHVE2 modules • Identification • Feature extraction and conformance for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics

  15. JHVE2 modules • Identification • Feature extraction and conformance for: • JPEG 2000, TIFF • WAVE • ASCII, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics

  16. JHVE2 data abstraction • Determine the “natural” conceptual structures of a format and their component attributes • Each such structure maps to a class with methods for parsing, validating, reporting, and serializing • Each such attribute maps to a field with accessor and mutator methods • UTF-8  Character • TIFF  IFH and IFD • JPEG 2000  Box • PDF  boolean, number, string, name, array, dictionary, stream, and null

  17. JHVE2 timeline • Months 1-6 Outreach, design, and prototyping • Months 7-9 Core APIs and framework • Months 10-24 Modules

  18. For more information… www.significantproperties.org.uk droid.sourceforge.net forge.gridforum.org/projects/dfdl-wg hki.uni-koeln.de/planets/ meta-extractor.sourceforge.net hul.harvard.edu/jhove www.ucop.edu:8080/display/JHOVE2Info/Home

More Related