170 likes | 294 Views
Working with metadata in digital archives. Erpanet Metadata in Digital Preservation Marburg, 3-5 September 2003 Bill Roberts bill.roberts@tessella.com Tessella Support Services plc 3 Vineyard Chambers Abingdon OX14 3PX United Kingdom www.tessella.com. Metadata functions. Edit. Import.
E N D
Working with metadata in digital archives Erpanet Metadata in Digital Preservation Marburg, 3-5 September 2003 Bill Roberts bill.roberts@tessella.com Tessella Support Services plc3 Vineyard ChambersAbingdon OX14 3PX United Kingdom www.tessella.com
Metadata functions Edit Import Search Collect View Store Export
Collect metadata (1) • Some must be manual – assist user, prevent mistakes • Avoid duplication – record hierarchies • automation in user environment (business process, workflow etc.) • automatic analysis of file properties • processing history (virus checking results etc.)
Collect metadata (2) • UK National Archives Digital Archive – Stellent “OutsideIn” • analyses file to determine type • could also form part of approach to extract metadata from content
Collect metadata (3) • Pfizer Central Electronic Archive • Small metadata set • Automatic collection of metadata • Software agents on user servers • Possible to do more • Improve ease of use • Improve accuracy • Pfizer aiming to simplify provenance metadata
Import metadata (1) • Transfer format – XML • link metadata to files during transfer • virus checking, file format analysis etc. • Maintain loose coupling between components of system – agreed interfaces
Import metadata (2) • Efficiency – large transfers • XML can be expensive to process • speed • memory – DOM can be 20 times larger than XML file
Storage - requirements • don’t lose it! • maintain links between metadata, records and files • find what you are looking for • retrieve
Storage approaches • encapsulation vs. ease of access • volume of data • speed of searching vs. speed of import/export • typically metadata in database and files on file server
The National Archives (UK) Digital Archive approach • Relational database for metadata, file server for computer files • Metadata stored as XML documents in database • A few key elements stored in tables and indexed (unique identifier, PROCAT reference) • Links between records, files, accessions, metadata managed in database • Subset of metadata identified as searchable – values extracted into text based index • File contents not currently searchable
UK Digital Archive (2) • record and file metadata kept separately • flexible relationship between records and computer files • Unlimited depth of record hierarchy (records can contain sub-records) • metadata imported/exported as XML so easier/quicker to store as XML • designed for ease of extension to metadata (disadvantage of extracting metadata into database tables) • <GSMElement name=“Title”> rather than <Title>
Alternatives • VERS approach: metadata and content files encapsulated together within XML file • +ve: record is self-contained • +ve: well-suited to use of digital signatures on both metadata and content • -ve: more denormalisation required for access • -ve: complexity of adding to or editing metadata • -ve: if file is needed for more than one record, must be duplicated
Interoperability • Not much experience in practice so far • XML helps - but not much! • Likely to be similar but not identical schemas • Different implementations of same schema • Short term: ad hoc mapping between schemas for specific systems • Longer term: various initiatives, but standardisation and semantics-based approaches are difficult
Extending or changing the schema • Schema may (will!) change in future • No “one size fits all” approach • TNA plans for extensions to core metadata according to file type and according to function • Version control
Preservation metadata • Maintain ability to understand and authentically reproduce content files • PRONOM system – separate database for file formats/accessibility • KB preservation layer model approach • Technology watch
Authentication/Integrity • Digital signatures – has something changed? (also simpler hashing algorithms) • Digital signatures – who signed it? • Control access • Audit logs
Conclusions • Digital preservation is still a young discipline, so “best” approach not always clear • Do something! Learn from experience • Design for flexibility/replaceability – records must outlive any implementation