Loading in 2 Seconds...
Loading in 2 Seconds...
Harvard’s Digital Repository Service (DRS) Architecture. Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009. Today’s Agenda. What is the DRS? DRS 1 Architecture DRS 2 Highlights Questions. 1. What is the DRS?. DRS Context.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009
Today’s Agenda • What is the DRS? • DRS 1 Architecture • DRS 2 Highlights • Questions
DRS Context • A core portion of HUL’s mission is to provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills • Digital Preservation projects and activities (2000-) • Digital Preservation Program (June 2008-) • Centerpiece: the Digital Repository Service (DRS)
What is the DRS? • Set of professionally managed services for preservation and access preservation planning & activities, administration, management tools creation & format guidelines, training, ingest service delivery services, access restrictions, persistent names metadata and content storage & monitoring service creation/ acquisition use
DRS by the numbers • 103 TB of content • 335 TB total (counting all copies) • 13 M files • 10 M image files • 21,000 audio files • 2.8 M text files • 851,000 compressed Google books • containing 672 M files • 6,300 compressed web harvests • containing 14 M web files
DRS growth • Fueled by large projects • Recent explosion – mass digitization (Google book project)
Broadening content and metadata requirements • New formats and genres, born-digital content • Email archiving, more audio, drawing, video • Descriptive metadata, linkages to catalogs • Rights management, more access restrictions • Auxiliary content • Contextual material, licenses, donor agreements, collection objects, documentation, repository agents
Metadata Storage Database DRS-1 Objects are modeled as related files File Metadata: • Administrative (owners, projects, deposit dates, owner IDs, etc.) • Technical (format mime-type & format specific data) • Role, purpose, quality • No descriptive metadata • Access restrictions (public, Harvard-only, dark) • MD5 file digest and byte count Relationship triples • “is_part_of”, “is_preservation_replacement_for”, etc. • 21 relationship types • ~13M files, 12.3M relationships
Content Storage ServiceBit preservation • Redundancy, heterogeneity, extensibility, scalability, simple file access protocol • Access demands high availability and high performance delivery • Functional requirements: • At least three copies in three physical locations • Two media types • Two on-line copies for high availability • One near-line copy, one off-line copy
Content Storage ServiceStorage provider • SUN SAM/QFS Storage Archive Manager • 2 file classes: highuse and lowuse • Archiving rules • High use files • Copy 1 on disk at local server center • Copy 2 on disk at remote server center • Copy 3 on tape in library • Copy 4 on tape off line at Harvard Depository • Low use files • Copy 1 on disk at remote server center • Copy 2 on tape in library • Copy 3 on tape off line at Harvard Depository • High speed cache for access
Consistency Validation Service • Continuous monitoring for file system and database consistency • Crawls the file system and confirms that every disk file has a DRS metadata record • Crawls the DRS metadata records table and confirms that every file referenced exists in the file system • Confirms that the MD5 checksum for each file is the same as recorded in the database • Reports errors to administrators
Delivery and Access Services • Real time web delivery • Image delivery service • JPEG, JPEG 2000, TIF, GIF • Page turned object delivery service • METS + page images + page text • Streaming delivery service • Real Audio • File delivery service • PDFs • Web Archiving Service • Asynchronous delivery service • Archival masters
Administrative Services • DRS Web Administrator • Searching, reporting, file operations, archival master download • Page Turned Object Maintenance • METS structure editor • Name Resolution Service Maintenance • URN create/update/report
Storage ServicesImplementation • Sun SAM-QFS 4.6 • Rule-based automatic archiving – no “backups” • Unified file name space • Dual Sun T2000 Solaris SAM servers • Redundant servers at site 1, DR failover at site 2 • Nightly samfsdump from site 1 - samfsrestore at site 2 • EMC CLARiiON disk storage arrays • RAID 1+0 FC cache/ RAID 5 SATA Disk Archives • 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2 • StorageTek SL500 tape library • LTO-4 • In production since Feb 2008
Metadata Storage ServiceImplementation DRS metadata storage • Oracle 10G • Live production server – copy 1 • Dataguard failover copy – copy 2 • Legato Tape backups – copy 3
Ingest ServicesImplementation • Batch deposit of SIPs to SFTP drop boxes • DRS Batch Loader operates 8AM-8PM • 51 object owners – libraries, museums • ~12 depositors • 234 project codes • Daily weekday deposits average ~60 GB/day
Delivery ServicesImplementation • High availability design • Redundant public access servers • Delivery, access management, name resolution • Cisco Content Switch • Load balancing, sticky sessions • MRTG monitoring • Change control – no downtime on updates • RHE linux, java 1.5, tomcat • Tomcat and log4j logging and statistics
Scope of work • Builds on the early 2008 storage upgrade • 2008-~2013 • Effects every part of the DRS! • Expanded data model • New and different metadata • Object descriptors • Content models • Preservation plans • Enhanced deposit tools • New management applications • New backend services • First major release: Summer 2011
Object descriptors • A METS metadata file per object on the file system alongside content files • Descriptive, administrative, preservation, technical and structural metadata • Describes the object, all its files and bitstreams and related significant events • Gives the metadata the same secure storage as the content files • Self-contained, portable objects
Some technical challenges • Amount of metadata to store • Bitstream description • Many elements (esp. MODS, MIX) • Efficient, scalable search implementation • Database, index, combination? • Keeping metadata in sync • Database, object descriptors on file system • Effect on system of continued growth • Consistency checks, migrations, format analysis, etc. • HRCI requirements • Email archiving
4. Questions? email@example.com firstname.lastname@example.org