1 / 39

Harvard’s Digital Repository Service (DRS) Architecture

Harvard’s Digital Repository Service (DRS) Architecture. Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009. Today’s Agenda. What is the DRS? DRS 1 Architecture DRS 2 Highlights Questions. 1. What is the DRS?. DRS Context.

amaris
Download Presentation

Harvard’s Digital Repository Service (DRS) Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009

  2. Today’s Agenda • What is the DRS? • DRS 1 Architecture • DRS 2 Highlights • Questions

  3. 1. What is the DRS?

  4. DRS Context • A core portion of HUL’s mission is to provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills • Digital Preservation projects and activities (2000-) • Digital Preservation Program (June 2008-) • Centerpiece: the Digital Repository Service (DRS)

  5. What is the DRS? • Set of professionally managed services for preservation and access preservation planning & activities, administration, management tools creation & format guidelines, training, ingest service delivery services, access restrictions, persistent names metadata and content storage & monitoring service creation/ acquisition use

  6. What’s in the DRS?

  7. What’s in the DRS?

  8. What’s in the DRS?

  9. What’s in the DRS?

  10. What’s in the DRS?

  11. What’s in the DRS?

  12. What’s in the DRS?

  13. What’s in the DRS?

  14. DRS by the numbers • 103 TB of content • 335 TB total (counting all copies) • 13 M files • 10 M image files • 21,000 audio files • 2.8 M text files • 851,000 compressed Google books • containing 672 M files • 6,300 compressed web harvests • containing 14 M web files

  15. DRS growth • Fueled by large projects • Recent explosion – mass digitization (Google book project)

  16. Broadening content and metadata requirements • New formats and genres, born-digital content • Email archiving, more audio, drawing, video • Descriptive metadata, linkages to catalogs • Rights management, more access restrictions • Auxiliary content • Contextual material, licenses, donor agreements, collection objects, documentation, repository agents

  17. 2. DRS 1 Architecture

  18. DRS System Architecture

  19. Metadata Storage Database DRS-1 Objects are modeled as related files File Metadata: • Administrative (owners, projects, deposit dates, owner IDs, etc.) • Technical (format mime-type & format specific data) • Role, purpose, quality • No descriptive metadata • Access restrictions (public, Harvard-only, dark) • MD5 file digest and byte count Relationship triples • “is_part_of”, “is_preservation_replacement_for”, etc. • 21 relationship types • ~13M files, 12.3M relationships

  20. Content Storage ServiceBit preservation • Redundancy, heterogeneity, extensibility, scalability, simple file access protocol • Access demands high availability and high performance delivery • Functional requirements: • At least three copies in three physical locations • Two media types • Two on-line copies for high availability • One near-line copy, one off-line copy

  21. Content Storage ServiceStorage provider • SUN SAM/QFS Storage Archive Manager • 2 file classes: highuse and lowuse • Archiving rules • High use files • Copy 1 on disk at local server center • Copy 2 on disk at remote server center • Copy 3 on tape in library • Copy 4 on tape off line at Harvard Depository • Low use files • Copy 1 on disk at remote server center • Copy 2 on tape in library • Copy 3 on tape off line at Harvard Depository • High speed cache for access

  22. Consistency Validation Service • Continuous monitoring for file system and database consistency • Crawls the file system and confirms that every disk file has a DRS metadata record • Crawls the DRS metadata records table and confirms that every file referenced exists in the file system • Confirms that the MD5 checksum for each file is the same as recorded in the database • Reports errors to administrators

  23. Delivery and Access Services • Real time web delivery • Image delivery service • JPEG, JPEG 2000, TIF, GIF • Page turned object delivery service • METS + page images + page text • Streaming delivery service • Real Audio • File delivery service • PDFs • Web Archiving Service • Asynchronous delivery service • Archival masters

  24. Administrative Services • DRS Web Administrator • Searching, reporting, file operations, archival master download • Page Turned Object Maintenance • METS structure editor • Name Resolution Service Maintenance • URN create/update/report

  25. DRS System Architecture

  26. DRS System ArchitectureIngest Services

  27. DRS System ArchitectureDelivery Services

  28. DRS System ArchitecturePersistent Naming and Access Services

  29. DRS System ArchitectureStorage Services

  30. Storage ServicesImplementation • Sun SAM-QFS 4.6 • Rule-based automatic archiving – no “backups” • Unified file name space • Dual Sun T2000 Solaris SAM servers • Redundant servers at site 1, DR failover at site 2 • Nightly samfsdump from site 1 - samfsrestore at site 2 • EMC CLARiiON disk storage arrays • RAID 1+0 FC cache/ RAID 5 SATA Disk Archives • 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2 • StorageTek SL500 tape library • LTO-4 • In production since Feb 2008

  31. Storage ServicesRedundancy

  32. Metadata Storage ServiceImplementation DRS metadata storage • Oracle 10G • Live production server – copy 1 • Dataguard failover copy – copy 2 • Legato Tape backups – copy 3

  33. Ingest ServicesImplementation • Batch deposit of SIPs to SFTP drop boxes • DRS Batch Loader operates 8AM-8PM • 51 object owners – libraries, museums • ~12 depositors • 234 project codes • Daily weekday deposits average ~60 GB/day

  34. Delivery ServicesImplementation • High availability design • Redundant public access servers • Delivery, access management, name resolution • Cisco Content Switch • Load balancing, sticky sessions • MRTG monitoring • Change control – no downtime on updates • RHE linux, java 1.5, tomcat • Tomcat and log4j logging and statistics

  35. 3. DRS 2 Highlights

  36. Scope of work • Builds on the early 2008 storage upgrade • 2008-~2013 • Effects every part of the DRS! • Expanded data model • New and different metadata • Object descriptors • Content models • Preservation plans • Enhanced deposit tools • New management applications • New backend services • First major release: Summer 2011

  37. Object descriptors • A METS metadata file per object on the file system alongside content files • Descriptive, administrative, preservation, technical and structural metadata • Describes the object, all its files and bitstreams and related significant events • Gives the metadata the same secure storage as the content files • Self-contained, portable objects

  38. Some technical challenges • Amount of metadata to store • Bitstream description • Many elements (esp. MODS, MIX) • Efficient, scalable search implementation • Database, index, combination? • Keeping metadata in sync • Database, object descriptors on file system • Effect on system of continued growth • Consistency checks, migrations, format analysis, etc. • HRCI requirements • Email archiving

  39. 4. Questions? andrea_goethals@harvard.edu randy_stern@harvard.edu

More Related