1 / 83

Using JHOVE2 for Policy Assessment of Files

Using JHOVE2 for Policy Assessment of Files. Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room. Agenda 13:30-16:30. What is JHOVE2 ? Characterization of digital objects Validation vs Assessment

fraley
Download Presentation

Using JHOVE2 for Policy Assessment of Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using JHOVE2 for Policy Assessment of Files Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room

  2. Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules

  3. JHOVE2 is … … a project to develop a next-generation open source framework and application for format-aware characterization … a collaborative undertaking of the California Digital Library (CDL), Portico, and Stanford University … a two year grant from the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP)

  4. “What? So what?” Determining the presumptive format of a digital object based on suggestive extrinsic hints and intrinsic signatures Reporting the intrinsic properties of an object significant for classification, analysis, and planning Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object • Identification • Feature extraction • Validation • Assessment

  5. What's new in JHOVE2? Je ne sais quoi ! Processing of multi-file objects as well as embedded objects inside files Recursive processing of containers objects Plug-in Format Modules Buffered I/O Internationalized output Clean APIs and modern design patterns

  6. API design idioms Separation of concerns Annotation and Reflection confluence.ucop.edu/display/JHOVE2Info/Background+Papers Inversion of Control (IOC) / Dependency Injection Martin Fowler martinfowler.com/articles/injection.html Spring Framework www.springsource.org/

  7. Project Home Domain name • http://jhove2.org/ Code Repository • https://bitbucket.org/jhove2/main/wiki/Home • Public Wiki/Documentation • Browse/Clone Source Code • Download Release Packages • Changeset History • Issue Tracking Mailing lists • JHOVE2-Announce-L@listserv.ucop.edu • JHOVE2-Techtalk-L@listserve.ucop.edu

  8. JHOVE2 Documentation Complete documentation • User’s guide • Architectural overview • Module specifications • Programmer’s guide

  9. Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules

  10. Characterization

  11. Validation vs. Assessment Validation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification • To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules • Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files

  12. Format Specifications

  13. Validation vs. Assessment Validation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification • To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules • Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files

  14. Putting it another way … Assessment is the evaluation of a source unit's reportable properties against a set of policy-based rules

  15. Assessment is the evaluation of a source unit's File  (UTF-8) File with embedded ByteStream(s) (TIFF with ICC profile) Aggregate (Directory, ZIP ) ClumpSource (ShapeFile) reportable properties against a set of policy-based rules

  16. Assessment is the evaluation of a source unit's reportable properties Format Identification Features Validity against a set of policy-based rules

  17. Assessment is the evaluation of a source unit's reportable properties against a set of policy-based rules Is the item acceptable? Is there a preservation risk? What level of preservation service? Should we flag object for future action?

  18. Practical Applications of Assessment Ingest workflows Migration workflows Digitization workflows Publishing workflows

  19. Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules

  20. Running JHOVE jhove2.sh –d Text –o outfile.txt myfile.xml Display format choices are: Text (default), JSON, and XML. File argument can be any of: • Filename • Directory name • URL • Set of space-delimited filepaths http://bitbucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide.pdf

  21. JHOVE2 Output options • Input File • xml-schemaLocation-cannot-resolve.xml • Text • text-output.txt • XML • xml-output.xml • JSON • json-output.txt

  22. JHOVE2 Output FileSource: Path: E:\samples\xml\schema-sample.xml Size (byte): 9516 LastModified: 2010-10-12T11:55:29-06:00 SourceName: schema-sample.xml StartingOffset (byte): 0 …

  23. Format Identification PresumptiveFormats: PresumptiveFormat {FormatIdentification}: NativeIdentifier {I8R}: Namespace: PUID Value: fmt/101  PRONOM Identifier JHOVE2Identifier {I8R}: Namespace: JHOVE2 Value: http://jhove2.org/terms/format/xml ...

  24. PRONOM Format Registry http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=638 Name Extensible Markup Language Version 1.0 Other names XML (1.0) Identifiers PUID: fmt/101 Apple Uniform Type Identifier: public.xml MIME: text/xml Classification Text (Mark-up) Description The Extensible Markup Language (XML) is a general purpose markup language for creating other, special purpose, markup languages, and is a simplified subset of SGML. …

  25. Agent used for Identification Module {DROIDIdentifier}: SignatureFile: …/DROID_SignatureFile_V20.xml Version: 2.0.0 ReleaseDate: 2010-09-10 WrappedProduct: Name: DROID Version: 4.0.0 ReleaseDate: 2009-07-23 ...

  26. DROID http://sourceforge.net/projects/droid/ DROID (Digital Record Object Identification) is an automatic file format identification tool. It is the first in a planned series of tools developed by The National Archives under the umbrella of its PRONOM technical registry service

  27. XML Module Module {XmlModule}: SaxParser: Parser: org.apache.xerces.parsers.SAXParser XmlDeclaration: Version:1.0 Encoding: UTF-8 Standalone: no RootElement: Name: mets Namespace: http://www.loc.gov/METS/

  28. XML Module (namespaces) NamespaceInformation: NamespaceCount: 2 Namespaces: Namespace: URI: http://www.loc.gov/METS/ Declarations: Prefix: [default] SchemaLocations: SchemaLocation: Location: http://www.loc.gov/standards/mets /version15/mets.xsd Namespace: URI: http://www.loc.gov/mix/v10 Declarations: Prefix: mix

  29. XML Module (cont) ValidationResults: ParserWarnings {ValidationMessageList}: ValidationMessageCount: 0 ParserErrors {ValidationMessageList}: ValidationMessageCount: 0 FatalParserErrors {ValidationMessageList}: ValidationMessageCount: 0 isWellFormed: true isValid: true

  30. ICC color profile JPEG 2000 PDF SGML Shapefile TIFF UTF-8 WAVE XML Zip Format Modules from JHOVE2 Team JHOVE2 can identify (by DROID) many more formats than it can validate (by modules)

  31. Other Module Development 3rd party development activities • NetCDF and GRIB modules (Wegener Institute) • Integration with DuraCloud (DuraSpace) • ARC module (Bibliothèque nationale de France) • WARC, JPEG, GIF modules (CDL, hopefully ;-) Possible development efforts • Additional format modules • Configuration GUIs • JHOVE2-as-a-service • Integration with DAITTS, DSpace, Fedora, FITS, etc. Suggestions, volunteers and funders welcome

  32. AssessmentModule Module {AssessmentModule}: AssessmentResultSets: AssessmentResultSet: RuleSetName: XmlRuleSet RuleSetDescription: RuleSet for Xml Module ObjectFilter: org.jhove2.module.format.xml.XmlModule BooleanResult: true AssessmentResults: AssessmentResult: RuleName: XmlValidityRule RuleDescription: Is the XML file acceptable? BooleanResult: true NarrativeResult: Acceptable

  33. Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules

  34. JHOVE2 Abstractions • Source Unit • Module • Reportable • Reportable Property • Message

  35. Source Unit A formatted object about which characterization information can be meaningfully reported • Unitary • File e.g. UTF-8 text file • File inside of a container e.g. TIFF inside a Zip • Byte stream inside a file e.g. ICC inside a TIFF • Aggregate • Directory • Directory inside of a container • Clump e.g. Shapefile • File set e.g. command line arguments For purposes of characterization, directories, file sets, and clumps are considered format types

  36. Source Interface (Java) public Set<FormatIdentification> getPresumptiveFormats() { return presumptiveFormatIdentifications; } public List<Module> getModules() { return this.modules; } public List<Source> getChildSources() { return this.children; }

  37. Format Module • implements Parser • implements Validator • Implements Reportable • Imports org.jhove2.annotation.ReportableProperty public long parse(JHOVE2 jhove2, Source source, Input input) { // extract features and //fill in the reportable properties fields . . . }

  38. Reportables A Reportable is a named set of properties Reportables correspond to Java classes Including classes for sources and modules Also define reportables for the major conceptual structures inherent to a format JPEG 2000: Box TIFF: IFH, IFD, IFD entry (“tag”) UTF-8: Character stream, character WAVE: Chunk

  39. Reportable Interface package org.jhove2.core public interface Reportable { public I8R getReportableIdentifier(); public String getReportableName(); public void setReportableName(String name); } public abstract class AbstractReportable implements Reportable { protected I8R reportableIdentifier; protected String reportableName; } A reportable class implements the Reportable marker interface

  40. ReportableProperties A ReportableProperty is a named, typed value • org.jhove2.annotation.ReportableProperty • Unique formal identifier • Data type • Scalar or collection • Java types, JHOVE2 primitive types, or JHOVE2 reportables • Typed value • Description of correct semantic interpretation • Properties correspond to fields

  41. ReportableProperty Annotation Each reportable property is represented by a field and accessor and mutator methods The accessor method must be marked with the @ReportableProperty annotation public class MyReportable implements Reportable { protected String myProperty; @ReportableProperty(order=1, desc=“description”, ref=“reference”) public String getMyProperty() { return this.myProperty; } public void setMyProperty(String property) { this.myProperty = property; } }

  42. Wave Reportable Properties chunks[ ]   formatChunkNotBeforeDataChunkMessage missingRequiredFormatChunkMessage missingRequiredDataChunkMessage missingRequiredFactChunkMessage isValid childChunks[ ] hasPadByte identifier isValid size

  43. UTF-8 Reportable Properties byteOrderMark c0Characters c1Characters codeBlocks eOLMarkers invalidCharacters[ ]    isValid numCharacters numLines numNonCharacters c0Control c1Control codeBlock codePoint codePointOutOfRange coverage invalidByteValues isByteOrderMark isC0Control isC1Control isNonCharacter isValid size

  44. XML Reportable Properties

  45. Fields for the reportable properties protected StringsaxParser = "org.apache.xerces.parsers.SAXParser"; protected XmlDeclarationxmlDeclaration = new XmlDeclaration(); protected StringxmlRootElementName; protected List<XmlDTD>xmlDTDs; protected HashMap<String,XmlNamespace>xmlNamespaceMap; protected List<XmlNotation>xmlNotations; protected List<String>xmlCharacterReferences; protected List<XmlEntity>xmlEntitys; protected List<XmlProcessingInstruction>xmlProcessingInstructions; protected List<String>xmlComments; protected XmlValidationResultsxmlValidationResults ; protected booleanwellFormed ;

  46. Getter methods for reportable properties import org.jhove2.annotation.ReportableProperty; @ReportableProperty(order = 1, value = "Java class used to parse the XML") public String getSaxParser() { return saxParser; } @ReportableProperty(order = 2, value = "XML Declaration data") public XmlDeclaration getXmlDeclaration() { return xmlDeclaration; } @ReportableProperty(order = 3, value = "Name of the document's root element") public String getXmlRootElementName() { return xmlRootElementName; }

  47. Messages if (position == start && ch.isByteOrderMark()) { Object [] messageParms= new Object [] {position}; this.bomMessage = new Message( Severity.INFO, Context.OBJECT, "org.jhove2.module.format.utf8.UTF8Module.bomMessage", messageParms); }

  48. Messages Messages are reportable properties Unique identifier info:jhove2/message/… Context Process Condition arising from the process of characterization Object Condition arising in the object being characterized Severity Error Warning Info Internationalizable

  49. Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules http://code4lib.org/conference/2011/schedule#preconf

More Related