harvard s digital repository service drs architecture n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Harvard’s Digital Repository Service (DRS) Architecture PowerPoint Presentation
Download Presentation
Harvard’s Digital Repository Service (DRS) Architecture

Loading in 2 Seconds...

play fullscreen
1 / 39

Harvard’s Digital Repository Service (DRS) Architecture - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Harvard’s Digital Repository Service (DRS) Architecture. Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009. Today’s Agenda. What is the DRS? DRS 1 Architecture DRS 2 Highlights Questions. 1. What is the DRS?. DRS Context.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Harvard’s Digital Repository Service (DRS) Architecture


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009

    2. Today’s Agenda • What is the DRS? • DRS 1 Architecture • DRS 2 Highlights • Questions

    3. 1. What is the DRS?

    4. DRS Context • A core portion of HUL’s mission is to provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills • Digital Preservation projects and activities (2000-) • Digital Preservation Program (June 2008-) • Centerpiece: the Digital Repository Service (DRS)

    5. What is the DRS? • Set of professionally managed services for preservation and access preservation planning & activities, administration, management tools creation & format guidelines, training, ingest service delivery services, access restrictions, persistent names metadata and content storage & monitoring service creation/ acquisition use

    6. What’s in the DRS?

    7. What’s in the DRS?

    8. What’s in the DRS?

    9. What’s in the DRS?

    10. What’s in the DRS?

    11. What’s in the DRS?

    12. What’s in the DRS?

    13. What’s in the DRS?

    14. DRS by the numbers • 103 TB of content • 335 TB total (counting all copies) • 13 M files • 10 M image files • 21,000 audio files • 2.8 M text files • 851,000 compressed Google books • containing 672 M files • 6,300 compressed web harvests • containing 14 M web files

    15. DRS growth • Fueled by large projects • Recent explosion – mass digitization (Google book project)

    16. Broadening content and metadata requirements • New formats and genres, born-digital content • Email archiving, more audio, drawing, video • Descriptive metadata, linkages to catalogs • Rights management, more access restrictions • Auxiliary content • Contextual material, licenses, donor agreements, collection objects, documentation, repository agents

    17. 2. DRS 1 Architecture

    18. DRS System Architecture

    19. Metadata Storage Database DRS-1 Objects are modeled as related files File Metadata: • Administrative (owners, projects, deposit dates, owner IDs, etc.) • Technical (format mime-type & format specific data) • Role, purpose, quality • No descriptive metadata • Access restrictions (public, Harvard-only, dark) • MD5 file digest and byte count Relationship triples • “is_part_of”, “is_preservation_replacement_for”, etc. • 21 relationship types • ~13M files, 12.3M relationships

    20. Content Storage ServiceBit preservation • Redundancy, heterogeneity, extensibility, scalability, simple file access protocol • Access demands high availability and high performance delivery • Functional requirements: • At least three copies in three physical locations • Two media types • Two on-line copies for high availability • One near-line copy, one off-line copy

    21. Content Storage ServiceStorage provider • SUN SAM/QFS Storage Archive Manager • 2 file classes: highuse and lowuse • Archiving rules • High use files • Copy 1 on disk at local server center • Copy 2 on disk at remote server center • Copy 3 on tape in library • Copy 4 on tape off line at Harvard Depository • Low use files • Copy 1 on disk at remote server center • Copy 2 on tape in library • Copy 3 on tape off line at Harvard Depository • High speed cache for access

    22. Consistency Validation Service • Continuous monitoring for file system and database consistency • Crawls the file system and confirms that every disk file has a DRS metadata record • Crawls the DRS metadata records table and confirms that every file referenced exists in the file system • Confirms that the MD5 checksum for each file is the same as recorded in the database • Reports errors to administrators

    23. Delivery and Access Services • Real time web delivery • Image delivery service • JPEG, JPEG 2000, TIF, GIF • Page turned object delivery service • METS + page images + page text • Streaming delivery service • Real Audio • File delivery service • PDFs • Web Archiving Service • Asynchronous delivery service • Archival masters

    24. Administrative Services • DRS Web Administrator • Searching, reporting, file operations, archival master download • Page Turned Object Maintenance • METS structure editor • Name Resolution Service Maintenance • URN create/update/report

    25. DRS System Architecture

    26. DRS System ArchitectureIngest Services

    27. DRS System ArchitectureDelivery Services

    28. DRS System ArchitecturePersistent Naming and Access Services

    29. DRS System ArchitectureStorage Services

    30. Storage ServicesImplementation • Sun SAM-QFS 4.6 • Rule-based automatic archiving – no “backups” • Unified file name space • Dual Sun T2000 Solaris SAM servers • Redundant servers at site 1, DR failover at site 2 • Nightly samfsdump from site 1 - samfsrestore at site 2 • EMC CLARiiON disk storage arrays • RAID 1+0 FC cache/ RAID 5 SATA Disk Archives • 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2 • StorageTek SL500 tape library • LTO-4 • In production since Feb 2008

    31. Storage ServicesRedundancy

    32. Metadata Storage ServiceImplementation DRS metadata storage • Oracle 10G • Live production server – copy 1 • Dataguard failover copy – copy 2 • Legato Tape backups – copy 3

    33. Ingest ServicesImplementation • Batch deposit of SIPs to SFTP drop boxes • DRS Batch Loader operates 8AM-8PM • 51 object owners – libraries, museums • ~12 depositors • 234 project codes • Daily weekday deposits average ~60 GB/day

    34. Delivery ServicesImplementation • High availability design • Redundant public access servers • Delivery, access management, name resolution • Cisco Content Switch • Load balancing, sticky sessions • MRTG monitoring • Change control – no downtime on updates • RHE linux, java 1.5, tomcat • Tomcat and log4j logging and statistics

    35. 3. DRS 2 Highlights

    36. Scope of work • Builds on the early 2008 storage upgrade • 2008-~2013 • Effects every part of the DRS! • Expanded data model • New and different metadata • Object descriptors • Content models • Preservation plans • Enhanced deposit tools • New management applications • New backend services • First major release: Summer 2011

    37. Object descriptors • A METS metadata file per object on the file system alongside content files • Descriptive, administrative, preservation, technical and structural metadata • Describes the object, all its files and bitstreams and related significant events • Gives the metadata the same secure storage as the content files • Self-contained, portable objects

    38. Some technical challenges • Amount of metadata to store • Bitstream description • Many elements (esp. MODS, MIX) • Efficient, scalable search implementation • Database, index, combination? • Keeping metadata in sync • Database, object descriptors on file system • Effect on system of continued growth • Consistency checks, migrations, format analysis, etc. • HRCI requirements • Email archiving

    39. 4. Questions? andrea_goethals@harvard.edu randy_stern@harvard.edu