1 / 15

Breakout Session One, Panel Five Content Transfer

Breakout Session One, Panel Five Content Transfer. NDIIPP Partner’s Meeting, Arlington, 8-10 July 2008. Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu. Topics. Submission CDL Digital Preservation Repository (DPR) www.cdlib.org/inside/projects/preservation/dpr

Anita
Download Presentation

Breakout Session One, Panel Five Content Transfer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Breakout Session One, Panel FiveContent Transfer NDIIPP Partner’s Meeting, Arlington, 8-10 July 2008 Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu

  2. Topics • Submission • CDL Digital Preservation Repository (DPR) www.cdlib.org/inside/projects/preservation/dpr • Dissemination • Chronopolis chronopolis.sdsc.edu • Mass Transit masstransit.sdsc.edu • Library of Congress • BagIt www.cdlib.org/inside/diglib/bagit/bagitspec.html

  3. Digital Preservation Repository (DPR) • The unit of submission is the object, composed of a descriptive METS file and multiple content files • Submission workflow • Initiated by client-side SOAP or REST client • Server-side validation • Package completeness • File-level data integrity and validation • Object-level conformance • Notification • Averaging 600 KB/sec (per process)

  4. Digital Preservation Repository (DPR) • Direct submission/ingest • Cultural heritage and scientific content from 5 campuses 2.5 TB via web services • Indirect submission to staging area with internally-triggered ingest • Local history content from 48 academic and public libraries 720 GB via HD, CD, DVD • Web harvested content 40 TB (est.) via HTTP 50 KB/sec with 4 second “politeness” policy • Google and OCA mass digitization content 150 – 200 TB (est.) via HTTP 3.8 MB/sec with 0.5% failure rate

  5. Chronopolis • Cross-domain collection sharing for long-term preservation • Data replication via SRB over a three node federated data grid • Project partners: UCSD/SDSC, NCAR, UMIACS • Data providers: CDL, ICPSR

  6. CDL web content • Stanford WebBase – 5 collections 14,108 GB • Federal government, 2004 – 2008 9,123 GB • State government, 2005 – 2007 1,742 GB • County government, 2005 – 2007 743 GB • City government, 2005 – 2007 1,531 GB • Hurricane Rita / Katrina, 2005 969 GB

  7. CDL web content • Web-at-Risk – 20 collections 1,452 GB • Myanmar cyclone, 2008 3 GB • Santa Cruz wildfires, 2008 4 GB • Southern California wildfires, 2007 78 GB • Grand jury reports, 2008 1 GB • California political parties, 2007 3 GB • AFL-CIO, 2007 1 GB • Progressive politics, 2007 – 2008 192 GB • Middle Eastern politics, 2007 – 2008 58 GB • University of California, 2007 – 2008 91 GB • … …

  8. CDL web content • Transfer of ARC files and manifest to CDL via HTTP • Transfer of Bags to Library of Congress via HTTP 28.7 MB/sec (16 parallel threads) • Transfer of Bags to UCSD/SDSC via HTTP 5.6 MB/sec (15 parallel threads)

  9. Mass Transit • CDL/SDSC investigation of critical issues in the large-scale transfer and replication of digital data for preservation • Initial focus on measuring and tuning network performance

  10. SDSC Network Diagnostic Tool (NDT)

  11. BagIt • Common need for low-overhead transfer of content between preservation partners • Minimally self-identifying and self-describing packages • Support for error detection and transfer optimization • Content agnostic • Informed by • NDIIPP Archive and Ingest Handling Test (AIHT) D-Lib Magazine, December 2005 • Tabata et al., “Enclose-and-Deposit Method,” IWAW ’05 • Documented at • www.ietf.org/internet-drafts/draft-kunze-bagit-01.txt • www.cdlib.org/inside/diglib/bagit/bagitspec.html

  12. BagIt • “Bag it and tag it” • Minimal metadata, file system structuring, and packaging rules abcd/ bagit.txt fetch.txt manifest-md5.txt package-info.txt data/ ... • bagit.txt – Bag signature and metadata • package-info.txt – Bag contents metadata • manifest-md5.txt – Bag contents manifest and checksums • fetch.txt – Bag contents included by reference, not value; i.e. “a bag of holes”

  13. Publication Transfer Validation Notification GrabIt • “Curb it and grab it” • Protocol for lightweight transfer without reliance on tedious, error-prone email-based conventions • Support for publication, transfer, validation, and notification • No dependence on BagIt, but capable of operating in an enhanced, bag-aware mode

  14. Summary • Transfer is still hard • Automate and reduce overhead • Transfer fewer big files, rather than many small files • Exploit parallelism • Robust transfer requires explicit verification and notification • Instrument and measure all phases of transfer to identify bottlenecks

  15. Sign on a Berkeley Ecology Center Recycling Truck Questions?

More Related