170 likes | 301 Views
This presentation explores the vital role of metadata in digital archives and preservation strategies. It discusses best practices for collecting, importing, storing, and exporting metadata to ensure accurate data retrieval and maintenance. Key topics include automation in user environments, interoperability challenges, and the need for flexible metadata schemas to accommodate future changes. Case studies from the UK National Archives and Pfizer Central Electronic Archive illustrate practical implementations. The session emphasizes the importance of learning from experiences to enhance digital preservation methodologies.
E N D
Working with metadata in digital archives Erpanet Metadata in Digital Preservation Marburg, 3-5 September 2003 Bill Roberts bill.roberts@tessella.com Tessella Support Services plc3 Vineyard ChambersAbingdon OX14 3PX United Kingdom www.tessella.com
Metadata functions Edit Import Search Collect View Store Export
Collect metadata (1) • Some must be manual – assist user, prevent mistakes • Avoid duplication – record hierarchies • automation in user environment (business process, workflow etc.) • automatic analysis of file properties • processing history (virus checking results etc.)
Collect metadata (2) • UK National Archives Digital Archive – Stellent “OutsideIn” • analyses file to determine type • could also form part of approach to extract metadata from content
Collect metadata (3) • Pfizer Central Electronic Archive • Small metadata set • Automatic collection of metadata • Software agents on user servers • Possible to do more • Improve ease of use • Improve accuracy • Pfizer aiming to simplify provenance metadata
Import metadata (1) • Transfer format – XML • link metadata to files during transfer • virus checking, file format analysis etc. • Maintain loose coupling between components of system – agreed interfaces
Import metadata (2) • Efficiency – large transfers • XML can be expensive to process • speed • memory – DOM can be 20 times larger than XML file
Storage - requirements • don’t lose it! • maintain links between metadata, records and files • find what you are looking for • retrieve
Storage approaches • encapsulation vs. ease of access • volume of data • speed of searching vs. speed of import/export • typically metadata in database and files on file server
The National Archives (UK) Digital Archive approach • Relational database for metadata, file server for computer files • Metadata stored as XML documents in database • A few key elements stored in tables and indexed (unique identifier, PROCAT reference) • Links between records, files, accessions, metadata managed in database • Subset of metadata identified as searchable – values extracted into text based index • File contents not currently searchable
UK Digital Archive (2) • record and file metadata kept separately • flexible relationship between records and computer files • Unlimited depth of record hierarchy (records can contain sub-records) • metadata imported/exported as XML so easier/quicker to store as XML • designed for ease of extension to metadata (disadvantage of extracting metadata into database tables) • <GSMElement name=“Title”> rather than <Title>
Alternatives • VERS approach: metadata and content files encapsulated together within XML file • +ve: record is self-contained • +ve: well-suited to use of digital signatures on both metadata and content • -ve: more denormalisation required for access • -ve: complexity of adding to or editing metadata • -ve: if file is needed for more than one record, must be duplicated
Interoperability • Not much experience in practice so far • XML helps - but not much! • Likely to be similar but not identical schemas • Different implementations of same schema • Short term: ad hoc mapping between schemas for specific systems • Longer term: various initiatives, but standardisation and semantics-based approaches are difficult
Extending or changing the schema • Schema may (will!) change in future • No “one size fits all” approach • TNA plans for extensions to core metadata according to file type and according to function • Version control
Preservation metadata • Maintain ability to understand and authentically reproduce content files • PRONOM system – separate database for file formats/accessibility • KB preservation layer model approach • Technology watch
Authentication/Integrity • Digital signatures – has something changed? (also simpler hashing algorithms) • Digital signatures – who signed it? • Control access • Audit logs
Conclusions • Digital preservation is still a young discipline, so “best” approach not always clear • Do something! Learn from experience • Design for flexibility/replaceability – records must outlive any implementation