1 / 48

Digital Preservation

HUL University Library Council. Digital Preservation. Dale Flecker Stephen Abrams February 15, 2007. Agenda. I The problem II What has Harvard been doing? III What more do we need to do?. I The problem …. … is twofold. Keeping the bits Keeping the bits useful. Keeping the bits.

gary
Download Presentation

Digital Preservation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HUL University Library Council Digital Preservation Dale Flecker Stephen Abrams February 15, 2007

  2. Agenda I The problem II What has Harvard been doing? III What more do we need to do?

  3. I The problem …

  4. … is twofold • Keeping the bits • Keeping the bits useful

  5. Keeping the bits • Digital things are amazingly easy to destroy! • Bad guys want to do damage • Hardware/software fails • People make mistakes • The slip of a finger, or an unnoticed consequence of change, happen easily - and are potentially catastrophic

  6. Destruction is not always apparent Data not used regularly is always at risk of unintended and unnoticed damage. (Note that archival copies can be pretty invisible…)

  7. Keeping bits useful Digital materials are fragile!!! They depend on technologies for their vitality… and those technologies age and disappear rapidly.

  8. Fragility • Using digital content requires mediation by hardware and software • Hardware and software must understand the format of the content • Hardware and software technology change continually …

  9. Fragility • Old technology will break • New technology frequently does not understand old formats

  10. II What has Harvard been doing? Internally …

  11. Digital Repository Service (DRS) • Secure, professionally managed environment • Manage data rigorously, with discipline, and in accordance to community best practices • Redundant, heterogeneous, distributed storage with periodic media migration …

  12. Digital Repository Service (DRS) • Know what data you have • What are the logical objects (“works”, not files)? • What are the technical characteristics of those objects? • Check the data continuously • Manage access to stored objects

  13. Format • Understanding formats is fundamental to preservation ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ...

  14. Format • Understanding formats is fundamental to preservation ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

  15. Format • Understanding formats is fundamental to preservation ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

  16. Format • Formats vary significantly in their “preservability” • Keeping multiple versions of a given piece of content for different purposes is frequently wise • E.g. archival master, production master, use copy

  17. Format • Some criteria for “preservability” (from LC) • Disclosure (how well documented?) • Adoption (how widely used?) • Transparency (is compression used?) • Self documenting (good!) • External dependencies (self sufficiency is good) • Patents (could limit preservation actions) • DRM/encryption (what if decryption key is not available?)

  18. Metadata • The basis of decision-making for preservation • Technical metadata • What format is this in? • What format options are used? • Structural metadata • If I change this, what else is affected? …

  19. Metadata • Administrative metadata • Who has the right to make decisions about this? • Relationship metadata • Are there other versions of this object? • How do these affect my preservation strategy? • Provenance metadata • Where did this come from? • What changes has it already undergone?

  20. Guidelines for “preservable” objects The least expensive, and most effective preservation measure is to think about the future when an object is created! (Guidelines on format, metadata, archival masters, etc.)

  21. JHOVE(JSTOR/Harvard Object Validation Environment) A widely used tool for format identification, validation, and characterization.

  22. JHOVE(JSTOR/Harvard Object Validation Environment) • When an object is ingested: • Determine its format • (“identify”) • Insure that it is properly formed • (“validate”) • Extract meaningful technical metadata • (“characterize”)

  23. DRS: what’s managed today As of January 2007, 5.6M files and 22 TB, excluding Google and web archiving

  24. II What has Harvard been doing? Externally…

  25. E-journal archiving • “How can we ensure that licensed e-journal content will remain usable over time?” • Mellon-funded study • Explored technical formats, content types, transactions and dataflows, validation, systems requirements, contractual requirements, business models • Harvard’s proposed model largely implemented by Portico

  26. Technical Metadata for Digital Still Images • “What are the appropriate technical metadata necessary for the preservation of images?” • Standardized as NISO Z39.87 • Expressed in the MIX schema • Maintained by LC • The basis for DRS image technical metadata

  27. METS(Metadata Encoding and Transmission Standard) • “Is there a generic packaging form for digital content?” • For example, • Digital books • Audio works • Images (archival master, production master, deliverables) • Useful for exchange of objects between repositories • Maintained by LC

  28. Core audio metadata • “What are the appropriate technical metadata necessary for the preservation of audio?” • Standardized as AES X-098 • Used as the basis for DRS audio technical metadata

  29. PDF/A • “PDF defines too many options; is there a ‘flavor’ that will be more ‘preservable’ over time?” • Requires, recommends, and restricts PDF functionality to enhance preservability • Standardized as ISO 19005

  30. PREMISPREservation Metadata: Implementation Strategies • “What are the general metadata elements necessary to preserve digital content over time?” • OCLC/RLG-sponsored work group • Recommendations and best practices for preservation metadata • Core elements, data dictionary, implementation strategies, cooperative projects …

  31. PREMISPREservation Metadata: Implementation Strategies • Report on current practices and recommended metadata elements available • Maintained by LC

  32. AIHT (Archive Ingest and Handling Test) • “What difficulties can we expect to arise during the exchange of content between heterogeneous repositories?” • LC-funded project to investigate exchange of complex data between preservation repositories • Harvard, Stanford, Johns Hopkins, Old Dominion ingest and exchange web archive data

  33. GDFR(Global Digital Format Registry) • “What will need to know in the future about formats in use today, and how will we know it?” • Shared registry of preservation-related information about technical format • Reduce work for repositories to create and maintain information about objects they ingest …

  34. GDFR(Global Digital Format Registry) • Enables sharing of format expertise • Directed by Harvard, implemented by OCLC • Funded by Mellon Foundation

  35. Registry of Digital Masters • “How can I found out who has accepted archival responsibility for a given piece of content?” • Initially reformatted materials; intention to expand to born-digital • DLF project • Implemented by and housed at OCLC

  36. Repository certification • “Why should a collection manager trust a digital repository?” • RLG/OCLC report on Trusted Repository Attributes • RLG/NARA Digital Repository Certification Task Force …

  37. Repository certification • Recommend structure and metrics of an international process for certifying preservation repositories • Organizational role and structure, staff size and skill, formal operations and documentation, appropriate technical infrastructure and facilities, on-going funding, and “hand-off” plan, etc. • CRL Auditing and Certification project

  38. Key activities elsewhere • ISO 14721 OAIS (Open Archival Information System) • LC NDIIPP (National Digital Information Infrastructure Preservation Program) • Web archiving (IA, IIPC) • NARA ERA (Electronic Records Archiving) • Digital Curation Centre • PLANETS

  39. III What more do we need to do?

  40. Evolution: from projects to program • Digital preservation requires continual pro-active program • You can’t just stop and start • Time frames are MUCH shorter than for preservation of physical collections • Need to define scope and role of our preservation efforts • Investment required in both technology and staffing

  41. Preservation lifecycle • Creation • Format and technical specification choices • Accompanying metadata • Packaging for ingest • Ingest • Validation • Normalization …

  42. Preservation lifecycle • Assumption of preservation responsibility • Monitoring • When is intervention necessary? • Changes to the technical environment • Changes to user expectations • Planning • Significant properties • All preservation decisions involve choice; how to choose what to preserve? …

  43. Preservation lifecycle • Intervention (preserving usability) • Re-acquisition • Re-generation from an archival master • Migration before necessary (“just in case”) • Migration at point of request (“just in time”) • Emulation of obsolete technology in contemporary environment • Universal Virtual Computer (UVC) • Rewrite necessary software to run on technology-agnostic “virtual” computer …

  44. Preservation lifecycle • Intervention (continued) • Save for digital archeologists • After intervention • Post-intervention quality assurance • Documenting the process of change • Succession planning • What do we do when we want to get out of the repository business?

  45. Staffing and responsibilities • Technical • Infrastructure maintenance • Monitor technological change • Integration into larger preservation environment • Preservation planning • Curatorial • Preservation intervention will involve trade-offs • What attributes need to be preserved? • Cost/benefit analysis

  46. Immediate challenges • Google • Substantial increase in scale (both number and size) • “Dark” content; no expectation of current access • Web archiving • Explosion of data types • No forethought on format selection and technical specifications • No metadata • Some failure may be inevitable

  47. Coming soon? • Institutional repository (IR) to enhance scholarly communication and preserve scholarly creations • Similar to web archiving: objects not typically created with preservation in mind, nor accompanied by metadata • “Just in case” local copies of licensed content • May necessitate increased sophistication of IPR management

  48. Longer term issues • Economics – What can we afford to preserve? • Scale – How much can we preserve? • Selection – What do we leave for others? • Federation – Can we share responsibilities for preservation? • Copies in independent environments are safest • Certification – Do we need formal certification? • Note Section 108 revision • Education – who at Harvard needs to understand?

More Related