1 / 12

The SMB Archive System: Data Backup Across the Web

The SMB Archive System: Data Backup Across the Web. Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory. Why a high capacity, long term data archive is needed. Need a replacement for tapes Tapes age and medium formats change rapidly. Storage capacity and reliability of tapes limited.

Download Presentation

The SMB Archive System: Data Backup Across the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SMB Archive System:Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory

  2. Why a high capacity, long term data archive is needed Need a replacement for tapes • Tapes age and medium formats change rapidly. • Storage capacity and reliability of tapes limited. • Much manual book-keeping is needed to keep track of data stored on tapes. Need to support large-area CCD detectors • Three Q315 detectors will be generating 20-80 MB files at much increased rate when the SPEAR3 upgrade is complete. • RAID data storage at SSRL will be 24 TB in 2004--all that data must be backed up somehow! • Need to archive data as rapidly as it is collected. Need to support high-throughput structural biology • Automated beam lines will generated huge amounts of data. • Large numbers of samples and targets require that metadata be stored and tracked systematically. • Data must be archived automatically and easy to retrieve.

  3. SMB Archive Uses NPACI Resources at SDSC National Partnership for Advanced Computational Infrastructure (NPACI) • Mission: advance science by creating national computational infrastructure: the Grid. • Maintains resources at San Diego Supercomputer Center (SDSC) including HPSS, SRB. High Performance Storage System (HPSS) • Centralized long term data storage system at SDSC. • Stores over 344 TB of data in 18 million files. (Jan 2002) • Capacity: 2000 GBytes Disk; 6000 TBytes Tape Storage. Storage Resource Broker (SRB) • Client-server middleware provides uniform interface for accessing heterogeneous resources over the network. • Presents data in hierarchical folders w/data and access controls. • May be used to store and retrieve data on the HPSS at SDSC. • Powerful metadata querying system allows data sets to be accessed based on their attributes. • Data sets can be replicated over multiple resources. • Organizations may install and maintain their own SRB Servers. We use the SRB installation at SDSC.

  4. Organizations Using SRB • Digital Libraries • UCB, Umich, UCSB, Stanford,CDL • NSF NSDL - UCAR / DLESE • NASA Information Power Grid • Astronomy • National Virtual Observatory • 2MASS Project (2 Micron All Sky Survey) • Particle Physics • Particle Physics Data Grid (DOE) • GriPhyN • Medicine • Digital Embryo (NLM) • Earth Systems Sciences • ESIPS • LTER • Persistent Archives • NARA • LOC • Neuro Science & Molecular Science • TeleScience/NCMIR, BIRN • SLAC, AfCS, …

  5. InQ SRB client for Microsoft Windows SRB client applications • Users must be able to upload data, download data, and view the data in the archive. • Users perform these functions via SRB client applications. • Available clients: Command-line programs (“S Commands”), InQ, MySRB. • Tools for custom clients: SRB C library; Java API. InQ for Microsoft Windows • InQ is the easiest to use client provided by NPACI. • Individual files or entire folders may be uploaded or downloaded. • Files in the archive may be browsed either by directory structure or by data attributes. Limitations of InQ • Runs only on Microsoft Windows platforms. • Windows is not the major platform used at synchrotron light sources or in crystallography research labs. • No batch job capability for long archive jobs. • Exposes confusing SRB features and terminology (resources, containers, collections, etc).

  6. MySRB web browser-based SRB client MySRB • MySRB is a powerful web-based SRB client which can be run from standard web browsers. • Files in the archive may be browsed either by directory structure or by data attributes. Limitations of MySRB • No way to upload or download more than one file at a time. • The otherwise rich functionality and powerful features are confusing to users. The bottom line: • Capabilities of HPSS and SRB far exceed the perceived needs of our beam line users. • Our users need a customized interface with simplified functionality. • Additional infrastructure had to be designed and implemented in order to make the SRB a viable storage system for crystallographic data. • A browser-based user interface is ideal.

  7. The SMB Archive interface for using the SRB Convenient web browser interface • Users may define archive jobs over the web from anywhere in the world using any common type of computer. • Users need only log in with their SMB Unix account name and password. Simple archive job definition • Users may rapidly browse their /home and /data directories at SSRL. • Directory contents are listed in the browser window. • Directories may be navigated by clicking on directory names. • Files to be uploaded may be filtered according to a list of wildcards. • Subdirectories may be archived recursively. • The only SRB related information required is the name of the new data collection to create.

  8. Monitoring archive jobs and downloading data Similar interface for data download • Users browse their archived data sets in exactly the same fashion. • Data may be downloaded from the archive to a directory at SSRL (analogous to an upload job). • Another option is to download selected files in one or more tar files directly to any computer on the Internet. Batch operation • Archive job runs in background once definition is confirmed. • Browser does not hang during archival. • New jobs may be started while previously defined jobs are in progress. • Automatically restarts jobs if HPSS is unavailable. • A job status page indicates definitions and status of all running jobs. • User may abort running jobs. • E-mail is sent to the user when a job is started and again when it is completed.

  9. Archive System Infrastructure • But first a word about SRB Accounts: • An SRB account (independent of the SSRL Unix Account) is required to archive data. • Your SRB account permits you to upload/download any data using SRB clients. • Handy web page on our site to create an SRB account: https://smb.slac.stanford.edu/secure/collaboratory/archive_system/SRBAccountForm.html • Archive System Infrastructure – the Archive System uses the following software elements: • Apache Web Server (v1.3.27) • Apache Tomcat Servlet Container (v4.1.24) • Java 2 Runtime (v1.4.1) • SMB Authentication Gateway Server • SMB Impersonation Server • SRB JARGON Java API (v1.1) • Archive System Servlets (for Upload, Download, and Job Maintenance) • Archive System Background Applications • All Archive System applications and servlets are written in Java. • Archive System front-end is made up of Java servlets. • Archive System back-end is made up of Java applications. • All infrastructure elements are either available for free or are home-grown.

  10. Significant infrastructure is required to provide this “simple” interface--but the payoff is huge. Authentication Gateway Server • Java servlet that provides a common authentication protocol for all web-based and stand-alone applications. • Used to authenticate archive system users. • All web-based software developed at SSRL is being updated to use this single authentication server. • Support for the authentication server has already been integrated into Blu-Ice/DCS. • Allows users to navigate seamlessly between applications without authenticating multiple times. • Will eventually allow access to beamline systems to be controlled automatically based on the beam schedule. • Access to other resources (computing, data directories, etc.) available 24/7 Impersonation Server • Unix daemon that can run any non-interactive program on behalf of any Unix user. • Enables web applications to run background jobs for a user with the actual rights of the Unix user account. • Accepts commands via the HTTP protocol. • Verifies authentication information with the Authentication Server. • Used by the archive system to list directories in the web browser and run background archive jobs as the user. • Will allow further analyses to be automatically initiated by the beam line control system.

  11. Apa che Impersonation Authentication Web Browser SRB HPSS MCAT Job Maintenance Download Jobs Upload Jobs View Job Status Define Download Define Upload Disk Cache Tape Storage Archive Jobs (background) Internet (Backbone) Internet Archive System Web Architecture SMB SDSC Archive Servlets (Tomcat)

  12. Archive Projects for the next year • Optimize data transfer rates between SSRL and SDSC. • Provide stand-alone application for users wishing to download datasets directly from the SRB. • Implement other functions available in inQ and MySRB for manipulating existing collections (replicate, delete, etc.) • Provide option for automatic data upload from Blu-Ice. • Provide link from Blu-Ice to automatically start browser and load Archive page w/o user having to log in again. (New Authentication Server makes this possible.) • Provide additional options for using SRB Metadata Catalog (MCAT) to describe, index, and retrieve data files. The Collaboratory forMacromolecular Crystallography is supported by the NIH, NCRR as a supplement to the SSRL Synchrotron Radiation Structural Biology Resource (P41-RR-01209). The SSRL Structural Molecular Biology program is funded by DOE BER, NIH NCRR, and NIH NIGMS.

More Related