Assessing Digital Objects with
Hannah Frost AND TEAM*
Stanford University Libraries and Academic Information Resources
- Repositories collect lots of technical metadata, but lack tools to use it to better understand the objects in their care, and to apply it precisely in management and operations. Recent years have seen the development of a number of metadata tools designed to identify and characterize digital objects in preservation workflows: JHOVE, Metadata Extraction Tool, DROID, and most recently FITS. Each is available to the digital preservation community, and the community makes good use of them.
- Repositories dutifully collect and store the technical and structural metadata exposed and output by these tools, yet typically repositories have limited or no means to analyze and evaluate object characterization data in ways that would facilitate more effective management of the objects under their care.
- According to a recent Planets survey report, less than a third of organizations reported that they have “complete control over the formats that they will accept and enter into their archives.”1 Repository managers have concerns about risks associated with file formats, and file format obsolescence generally. To support preservation services for content considered to encoded in “risky” formats, some repositories are developing policies and profiles that reflect their local concerns and operational contexts.2 They seek tools to assess the technical metadata gathered and stored in routine repository operations against those policies in order to make sense of it on local terms and inform a decision-making process, such as:
- accept / reject
- determine level of risk
- assign level of service
- take action now / later.
- Recent efforts to develop and apply assessment methodologies in digital object workflows and repository operations include:
- AONS II (Automated Obsolescence Notification System), National Library of Australia and APSR3
- CIV (Configurable Image Validator), Library of Congress
- Institutional Technology Profiles, National Library of New Zealand4.
- In JHOVE2 – a next-generation characterization tool currently in development at California Digital Library, Portico, and Stanford University – the team has designed an approach to facilitate policy-based assessment as an object is processed. The tool produces characterization data through a series of identification, feature extraction, validation, and assessment processes. It tells you what you have, as the starting point for iterative preservation planning and action.
- It is possible to assess the properties of an object against a set of “rules” configured by the user. Using logical expressions as its terms, a rule is an “assertion” about prior characterization properties. An assertion may be concerned with:
- The presence or absence of a property;
- Constraints on property values;
- Combinations of properties or values.
- In assessment, the evaluation of the assertion results in a new characterization property. In this sense, the process generates custom metadata that has significance in the context in which the object is being managed.
- The basic formation of a rule is shown below. The user configures the property and value to test, and selects an evaluatory phrase relating the two to form the complete assertion.
Technical implementation of assessment within the JHOVE2 framework is in design; prototyping will begin soon. A leading requirement is that rule configuration is simple. Non-technical staff and technical staff alike must be able to easily configure rules and “run an assessment”. The JHOVE2 release in 2010 will include a small selection of sample rules and a thorough tutorial.
Assessment with JHOVE2 has natural applications in ingest, migration, publishing and digitization workflows. A clean, well-defined, open API will be available in order to extend it to build tools capable of more complex analyses, such as a weighted scoring system or matching of technology profiles.
JHOVE2 can be integrated with other identification tools as well as format and software registries to form robust policy engines and other rules-based systems. Such systems have great potential in supporting and enabling digital preservation activities and services both at the local level and across the community.
- TIFF with nonaligned byte offset
- Rule Configuration
- Assertion: Message [Information], Contains, Non-wordAlignedOffset
- Result: True
- Response If True: Acceptable
(2) PDF with malformed dictionary
Assertion: Message [Information], Contains, Malformed dictionary
Response If True: At Risk
About the JHOVE2 Project
JHOVE2 is a collaboration by the California Digital Library, Portico, and Stanford University Libraries with funding from the Library of Congress’ National Digital Infrastructure and Information Preservation Program. The two year project will conclude, and the open source tool will be released, in September 2010.
* The JHOVE2 Team is …
CDL: Stephen Abrams, Patricia Cruse, John Kunze, Marisa Strong, Perry Willett
Portico: John Meyer, Sheila Morrissey, Evan Owens
Stanford: Richard Anderson, Tom Cramer, Hannah Frost
with Walter Henry, Nancy Hoebelheinrich, Keith Johnson, Justin Littman
(3) WAVE does not meet encoding specification
Assertion1: isValid, isEqualTo, True
Assertion2: BitDepth, isEqualTo, 24
Assertion3: SamplingFrequency, isEqualTo, 96000
Response If False: Reject
In addition, the user provides two responses for each rule: one to report if the assertion is true, and one to report if it is false.
This response constitutes the customizable metadata that is available for subsequent processing or analysis.
Rules can be executed as atoms, or chained together to form compound statements for more complex assessments.
- 1. Planets (2009). Survey Analysis Report, IST-2006-033789, DT11-D1.
- 2. Rog, J. and van Wijk, C. (2008). Evaluating File Formats for Long-term Preservation. National Library of the Netherlands; The Hague, The Netherlands. http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf.
- 3. Pearson, D. and Webb, C. (2008). Defining File Format Obsolescence: A Risky Journey. International Journal of Digital Curation. Vol 1: No 3. http://www.ijdc.net/index.php/ijdc/article/view/76
- 4. De Vorsey, K. and McKinney, P. (2009). One Man’s Obsoleteness is Another Man’s Innovation: A Risk Analysis Methodology for Digital Collections. Presented at Archiving 2009, Arlington, Virginia, May 2009.