120 likes | 424 Views
Designing Storage Architectures for Preservation Collections Library of Congress, September 17-18, 2007 Preservation and Access Repository Storage Architecture. Stephen Abrams Harvard University Library stephen_abrams@harvard.edu. Digital preservation at Harvard.
E N D
Designing Storage Architectures for Preservation Collections Library of Congress, September 17-18, 2007Preservation and Access RepositoryStorage Architecture Stephen Abrams Harvard University Library stephen_abrams@harvard.edu
Digital preservation at Harvard • Obligation to ensure the ongoing usability of library digital assets over time • Digital Repository Service (DRS) • Managed preservation and access repository • Seven years of production operation • 6.7 million assets (27 TB) • Primary strategy: redundancy and heterogeneity • Primary challenge: scaling
Storage classification • All managed assets are assigned a storage classification • Public use (U) High availability, fast response • Archival storage (A) High capacity, low cost • Use assets are optimized for web-friendly delivery • Archival assets are optimized for longevity • Asset classification is known at the point of acquisition
Architectural requirements • Each asset is stored: • In at least 3 physical locations • On at least 2 storage mediums • With at least 2 on-line copies (U) / 1 on-line copy (A) • With at least 1 off-line copy • Ongoing auditing for bit-level error detection and correction • Virtualization layer with uniform interface to all assets, regardless of physical medium • Application interface exposed as NFS-mountable file systems
Storage architecture • QFS cache and primary U disk archive on EMC CX3-40 (FC/SATA, RAID-1/RAID-5) at on-campus data center • Redundant switched FC data paths to primary/fail-over Sun T2000/Solaris file servers running SAM-QFS • Primary A/secondary U disk archive on EMC CX3-80 (FC/SATA, RAID-1/RAID-5) at off-campus data center • Redundant FC data paths to T2000 file server running SAM-QFS • Secondary A/tertiary U tape archive on StorageTek SL500 (LTO-3) FC-attached to primary on-campus T2000 • Tertiary A/quaternary U tape archive on LTO-3 media at off-campus managed storage facility • Disk archives are UFS file systems containing Tar files; even with the loss of the SAM infrastructure they are susceptible to full (if time-consuming) recovery with standard Unix/Linux tools
Storage virtualization • SAM-QFS reader/writer on primary on-campus T2000 file server • SAM-QFS reader on fail-over on-campus/off-campus T2000 file servers • All U and A assets written to QFS cache on CX3-40 • Immediate creation of all UFS disk and LTO-3 tape archive copies • Immediate release from cache with “stage never” • SAM manages all copies of all assets; externally each asset appears as a single file in an NFS-mountable file system • Application access requests are initiated by NFS reads and are fulfilled directly from primary disk archive copy without staging to cache
Issues • Disk vs tape • LTO-3 vs LTO-4 • Tape archive media pooling • All hardware/software installed; currently engaged in configuration and preliminary unit / integration testing • Need to establish benchmarks for system performance • Planning for migration from existing storage solution • Automated data classification • Response to an anticipated escalating rate of asset acquisition • Google mass digitization • Web archiving • Audio/video content • Scientific data sets