html5-img
1 / 67

Trusted Datagrids: Library of Congress Projects with UCSD

Trusted Datagrids: Library of Congress Projects with UCSD. Ardys Kozbial – UCSD Libraries David Minor - SDSC. Building Trust in a 3 RD Party Repository: A Pilot Project. David Minor San Diego Supercomputer Center. someone they can’t control?. How can the LC trust.

cooper
Download Presentation

Trusted Datagrids: Library of Congress Projects with UCSD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trusted Datagrids:Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC

  2. Building Trust in a 3RD Party Repository: A Pilot Project David Minor San Diego Supercomputer Center

  3. someone they can’t control? How can the LC trust

  4. Moving forward in the right direction requires more than fuzzy promises

  5. Cyberinfrastructure … it takes a combination of experts and tools.

  6. Cyberinfrastructure is the collection of ... Resources Computers, data storage, networks, scientific instruments, experts, etc. +Glue Integrating software, systems, and organizations

  7. “Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.” - ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences

  8. “The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure”

  9. SDSC ... • Is one of the original NSF supercomputer centers • Supports high performance computing systems • Supports data applications for science, engineering, social sciences, cultural heritage institutions • Has LARGE data capabilities • 3+ PB Disk Storage • 25+ PB Tape Storage

  10. UCSD Libraries • 3.5+ million volumes • Digital Access Management System (in development) • 250,000+ objects • 15+ TB • Shared collections with UC • California Digital Library • Digital Preservation Repository • eScholarship repository

  11. Partnerships and Collaborations • LC Pilot Project – Building Trust in a 3rd Party Repository • Using test image collections/web crawls ingest content to SDSC repository • Allow access for content audit • Track usage of content over time • Deliver content back to LC at end of project • Library of Congress NDIIPP Chronopolis Program • Build Production Capable Chronopolis Grid (50 TB x 3) • Further define transmission packaging for archival communities • Investigate best network transfer models for I2 and TeraGrid networks • California Digital Library (CDL) Mass Transit Program • Enable UC System Libraries to transfer high-speed mass digitization collections across CENIC/I2 • Develop transmission packaging for CDL content • UCSD Libraries’ Digital Asset Management System • RDF System with data managed in SRB at SDSC

  12. SDSC DPI Group • Digital Preservation Initiatives Group • Charged with Developing and Supporting Digital Preservation Services within the Production Systems Division of SDSC. • http://dpi.sdsc.edu • Cross-Organizational Group • SDSC Personnel/UCSD Libraries Personnel • Libraries • Archives • Technology • Information Science

  13. Cyberinfrastructure Trust

  14. For Example:

  15. We worked together to setup high speed data replication services Achieved 200Mb/s = 2 TB/day Highly reliable Checksums Internet2 Checksums

  16. Network setup involved … • LC and SDSC staff working together • Configurations on networks and computers • Resolving different security environments • Network monitoring

  17. Networking is hard! It’s not magic - there’s always a reason Lessons Learned It highlights collaborative nature of work Can’t forget it once it’s setup

  18. Have multi-institutional issues been solved? Does new infrastructure improve process? Trust Elements Has a long-term solution been found? Is solution useful for other organizations?

  19. SDSC created a robust storage environment for this data Multiple replications … … at SDSC … and geographically diverse locations

  20. (a process with several characteristics) • Needed to replicate structure exactly • This had to be done for 5+ replications • Complex environment had to be transparent • Data had to be available for manipulation

  21. The Storage Resource Broker provided replication services ...

  22. ... and extensive monitoring, logging and reporting functions (which led to many conversations)

  23. Logging and monitoring procedures What is the master list and who maintains it? Who decides what is a legitimate change? Do you want a dark archive or an active remote data center? • Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward • But …

  24. We tested a new Front-End

  25. … and explored an important issue • “Reliability” • Versus • “Accessibility”

  26. Always keep expectations aligned Duplication of structure is complicated Lessons Learned Don’t confuse accessibility and reliability Communication highlights communication

  27. Can remote data be accessed? Can remote data be verified? Trust Elements Can remote data be retrieved and re-used? Can ownership be clearly defined?

  28. SDSC and LC explored a new approach to working with web archives Parallel indexing and display system Looked “default” to the user 50,000 ARC files 6 Terabytes of data Short processing time

  29. Using default tools, our initial indexing rate was 1000 files per day… … more than 6 weeks of constant computing to index entire collection. This was over our time budget.

  30. We ran 18 parallel indexing instances – reduced processing to a week We modified the Wayback sourcecode to create a new access infrastructure

  31. Default setup isn’t always easiest Time is a wonderful motivator Lessons Learned Sometimes you need to start over Experts are often interested in your work

  32. Are the final results the same? Can the results be reached in a better way? Trust Elements Can a new organization bring new expertise? Can a new organization work with your partners?

  33. Chronopolis! Next steps ….

  34. Chronopolis: A Partnership UCSD Libraries • Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries. • Initial Chronopolis provider sites include: • SDSC and UCSD Libraries at UC San Diego • University of Maryland • National Center for Atmospheric Research (NCAR) in Boulder, CO

  35. Institutions and Roles - UCSD SDSC Storage and networking services SRB support Transmission Packaging Modules UCSD Libraries Metadata services (PREMIS) DIPs (Dissemination Information Packages) Other advanced data services as needed

  36. Institutions and Roles - NCAR National Center for Atmospheric Research Archives: Complete copy of all data Storage and network support Network testing

  37. Institutions and Roles - UMIACS University of Maryland – Institute for Advanced Computer Studies Archives: Complete copy of all data Advanced data services PAWN: Producer – Archive Workflow Network in Support of Digital Preservation ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives Other advanced data services as needed

  38. SDSC Chronopolis Program

  39. Chronopolis Vocabulary • Partners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services. • Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network. • SRB – Storage Resource Broker – datagrid software. • iRODS – integrated Rule Oriented Data System – datagrid software. • ACE – Audit Control Cnvironment – part of the ADAPT project at UMD. • PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD. • INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware. • Bagit – Transfer specification developed by CDL and the Library of Congress. • GridFTP – parallel transfer technology - moves large collections within a grid wide-area network.

  40. Chronopolis: Inside Linked by main staging grid where data is verified for integrity, and quarantined for security purposes. Collections are independently pulled into each system. Manifest layer provides added security for database management and data integrity validation. Benefits 3 independently managed copies of the collection High availability High reliability Chron Clients: CDLICPSR Push NCAR SDSC Staging Grid UMD Pull Pull Copy 3 Copy 2 Pull Grid Brick Disks SDSC Core Center Archive MCAT MCAT Manifest Management MCAT DB Multiple Hash Verifications Copy 1 HPSS Tape MCAT Grid Brick Disks

  41. SDSC Leveraged Infrastructure • Serves Both HPC & Digital Preservation • Archive • 25 PB capacity • Both HPSS & SAM-QFS • Online disk • ~3PB total • HPC parallel file systems • Collections • Databases • Access Tools Adapted from Richard Moore (SDSC)

  42. Chronopolis Demonstration Project • Demonstration Project 2006-2007 • Demonstration Collections Ingested within Chronopolis • National Virtual Observatory (NVO) • 3 TB Hyperatlas Images (partial collection) • Library of Congress PG Image Collection • 600 GB Prokudin-Gorskii Image Collection • Interuniversity Consortium for Political and Social Research (ICPSR) • 2TB Web Accessible Data • NCAR Observational Data • 3TB Observational Re-Analysis Data

  43. NDIIPP Chronopolis Project • Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR • Installing and testing a suite of monitoring tools using ACE, PAWN, INCA • Creating Appropriate Transmission Information Packages • Generating PREMIS definitions for data • Writing Best Practices documents for clients and partners

  44. Chronopolis Grid Framework Chronopolis Data 12-25TB Chronopolis Data 12TB Sun 6140 62TB SRB MCAT SRB D-Broker SRB D-Broker CDL Server CDL Server ICPSR Server UC BerkeleyNetwork ICPSR Network NCAR Network NCAR Network SRB MCAT SDSC Network SDSC Network UMD Network MarylandNetwork SRB D-Broker Apple Xsan SRB D-Broker SRB MCAT SRB D-Broker SRB D-Broker Sun SAM-QFS Tape Silos Adapted from Bryan Banister (SDSC)

  45. NDIIPP Chronopolis Clients-CDL California Digital Library A part of UCOP, supports the University of California libraries Providing up to 25TB of data: Web-At-Risk project Five years of political and governmental websites ARC files created from web crawls Using Bagit Transfer Structure

More Related