1 / 43

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Ce

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek

zalika
Download Presentation

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Ce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/

  2. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath M. Kulrul L. Sui Undergraduate Interns N. Cotofana M. Shumaker J. Trang L. Yin +/- NN Data and Knowledge Systems Group

  3. Information Management Projects • Digital Libraries • California Digital Library - Art Museum Image COnsortium • DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia • NLM Visible Embryo digital library - GMU, OHSI, UC, LSU, JHU • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech • NSF National Science Education Digital Library - UCAR, Cornell, UCSB, U Mass, Columbia • Data Grid Environments • DOE Data Visualization Corridor - LLNL, OSU • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames, USC/ISI, U Texas • NIH Biomedical Informatics Research Network - Duke, Harvard • NSF Grid Physics Network - U Florida, Caltech, U Wisc, USC/ISI, U Chicago • NSF National Virtual Observatory - JHU, Caltech • NSF Southern California Earthquake Center - USC/ISI • Persistent Archives • NARA Persistent Archive - UCB, U Maryland • NHPRC Archivist workbench - U Minnesota • NSF NSDL Persistent archive for curricula modules - Cornell

  4. Topics • Data management systems • Data collections, digital libraries • Distributed data management • Data grids • Persistent data management • Persistent archives • Common infrastructure for data management

  5. Data Collections • Define the context for describing a collection of digital entities • Context specified by metadata attributes • Provenance, origin of the digital entities • Administrative, location of the digital entities • Technical, purpose of the digital entities • Support organization of attributes as hierarchy of sub-collections

  6. Digital Libraries • Provide services on the data collection • Ingestion, loading of attribute values • Extensibility, definition of new attributes • Discovery, queries on attributes • Browsing, hierarchical listing • Presentation, formatting specified data models

  7. Data Grids • Manage data in a distributed environment • Logical name space, provide global identifier • Data access, storage system abstraction • Replication, disaster back up • Uniform access, common API across file systems, archives, and databases • Single sign-on, authenticate across administration domains

  8. Persistent Archives • Manage technology evolution • Storage system abstraction, support data migration across storage systems • Information repository abstraction, support catalog migration to new databases • Logical name space, support global persistent identifier

  9. Storage Resource Broker • Integration of collection-based management of digital entities, with • Remote data access through storage system abstraction • Catalog access through information repository abstraction • Automation through collection-owned data

  10. Capabilities • Support legacy systems • Integrate archives with file systems • Share distributed data • Maintain persistent collection • Control data access

  11. Uniform API • Provide common access semantics • Map from the interface preferred by your application to the interfaces required by legacy storage systems

  12. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Common APIs Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  13. Discovery Transparencies • Naming transparency - find a data set without knowing its name • Map from attributes to a global file name • Location transparency - access a data set without knowing where it is • Map from global file name to local file name • Access transparency - access a data set without knowing the type of storage system • Federated client-server architecture

  14. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Transparencies Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  15. Persistent Collection • Maintain authenticity • Authenticate all accesses • Assign roles for access control lists (curation, write, annotate, read) • Manage audit trails of all operations • Collection-owned data • All accesses through the data management system

  16. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Persistency Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  17. Preservation(Similar requirements to a data grid) • Name transparency • Find a file by attributes (map from attributes to global name) • Location transparency • Access a file by a global identifier (map from global to local file name) • Access transparency • Use same API to access data in archive or file cache • Authenticity • Disaster recovery, replicate data across storage systems • Audit and process management

  18. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Preservation Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  19. Convergence of Technologies • Data grids as basis for distributed data management • Federation of distributed resources • Creation of logical name space to automate discovery • Distributed data collections • Discovery based on attributes • Distributed data storage systems • Digital libraries • Development of services for manipulating, viewing data • Persistent archives • Management of technology evolution

  20. Digital Entities • Digital entities are “images of reality”, made of • Data, the bits (zeros and ones) put on a storage system • Information, the attributes used to assign semantic meaning to the data • Knowledge, the structural relationships described by a data model • Every digital entity requires information and knowledge to correctly interpret and display

  21. Differentiating between Data, Information, and Knowledge • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

  22. Knowledge Creation Roadmap • Knowledge syntax (consensus) • RDF, XMI, Topic Map • Knowledge management (recursive operations) • Oracle parallel database • Knowledge manipulation (spatial/procedural rules) • Generation of inference rules and mapping to data models • Knowledge generation (scalable inference engine) • Application of inference rules in inference engine

  23. Knowledge Based Data Grid Roadmap Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP XML DTD Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) MCAT/HDF Grids Feature-based Query

  24. Information Management Projects • Digital Libraries • California Digital Library - Art Museum Image COnsortium • DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia • NLM Visible Embryo digital library - George Mason University • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech 2MASS sky survey • Data Grid Environments • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames • NIH Biomedical Informatics Research Network - Duke, Harvard • NSF Grid Physics Network - U Florida • NSF National Virtual Observatory - Johns Hopkins University, Caltech • NSF Southern California Earthquake Center - USC/ISI • Persistent Archives • NARA Persistent Archive - UCB, U Maryland • NHPRC Archivist workbench - U Minnesota • NSF NSDL Persistent archive for curricula modules - Cornell University

  25. Additional Projects • ACS Alliance for Cell Signaling • NSF Digital Government - Fed Web • DOE - SciDAC Grid Portal • DOE - SciDAC Scientific Data Management Project • Hayden Planetarium

  26. NASA Information Power Grid • Develop digital library interface to the archives at NASA Ames - SRB • Demonstrate high performance data access across both NASA and NSF resources • Demonstrate telescience through NASA resources (link electron microscope at UCSD, with image collection at NASA Ames, and processing of data at NASA Ames)

  27. Hayden Planetarium • Provide a data collaboration environment • Share data between NCSA (simulations of the solar system evolution), SDSC (3D visualizations), and Hayden (review) • Manage 3-6 TBs of data • Provide seamless access across administration domains and storage resources

  28. The SRB is great. It has been utterly essential in this project - we could not have done this work without the SRB. We have used the SRB as a central shared repository for raw data, derived data, and visualization results. Data has been submitted by, and retrieved by each of the partners in this project on a daily basis. Email back and forth frequently includes SRB directory names into which data or results have been stored for broad review at different sites. The visualization animations we've produced are immediately placed into the SRB where they are downloaded, simultaneously, by people at NCSA and the museum in New York. The animations are reviewed, new data generated, email exchanged, and new images rendered and put back into the SRB to start the next review cycle. The whole thing has worked flawlessly. I am delighted and will gladly promote its virtues at any opportunity. As you've seen, I do have some suggestions for future functionality here and there. The "migration awkwardness" is one. But these suggestions are for added features or minor interface smoothing. They should in no way diminish the fact that it all works wonderfully! Thanks! -- Dave

  29. Visible Embryo Project • Build a digital library of images, reports for use by educators and physicians • Manage transfer of images from Armed Forces Institute of Pathology to an archive at SDSC • Provide access to the material • Use the SRB/MCAT system to assemble a digital library

  30. Visible Embryo Project Disk Cache AFIP: Collab WS Image Generation OHSU Eolas GST ATD Net NIC Disk Cache UIC Startap ASX200 BEN MSWS NT WS MSWS NT WS Oakland HSCC WRL 100 Gbit Vegas OC-3 JHU Disk Cache DS3 Los Angeles VBNS OC-12 Abilene OC-3 GMU Disk Cache DC POP OC-3 Abilene OC-3 SDSC Archive

  31. National Virtual Observatory • Federate existing sky survey image collections and catalogs • Support statistical analyses across multiple surveys • Implement services support environment • Use the SRB/MCAT to support bulk data access, replicate the major sky surveys, support large scale database record analyses

  32. National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.

  33. Particle Physics Data Grid • Support replication of data sets for the high energy physics grid • Federate data collections for the BaBar experiment at SLAC using the SRB • Federate BaBar collections between SLAC and Lyons, France • Support web service interface, support derived data product catalog

  34. S-Commands S-Commands Wisc Client 2 SRB Server @Wisc Wisc Client 1 SRB Server @LBL SRB Server @LBL file caching esrb.driver esrb.driver IPC IPC Stage() purge() fileStatus() file purging File caching request Stage() purge() fileStatus() FC HRM HPSS esrb.server Stage() purge() fileStatus() Disk cache Particle Physics Data Grid - Replication System

  35. National STEM Education Digital Library - NSDL • Provide persistent archive for educational material indexed in the NSDL repository • Develop knowledge spaces to characterize collection holdings • Map knowledge spaces to AAAS2061 concept space, and to state mandated grade level concepts

  36. Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry referenced items & collections Core Services: metadata normalizing CI Services personalization referenced items & collections Referenced Items & Collections Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces NSDL Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building

  37. National Archives Records Administration • Develop prototype persistent archive for NARA digital holdings • Identify pertinent research areas for long term preservation of data, information, and knowledge • Apply data grid technology for the implementation of persistent archives.

  38. Grid Physics Network • Develop infrastructure to support virtual data products • Create repository for derived data products • Automate the extraction of metadata from Virtual Data Language files • Automate extraction of administrative data through grid portals

  39. GridPort + SRB Architecture • With SRB capabilities, file access is direct, uniform • Uses same authentication as portal and other Grid services • Single SRB account access allows for more flexible data management

  40. DOE Scientific Data Management • Develop knowledge management tools to mediate between biological information resources • Integrate the tools into the DOE scientific data management environment

  41. Scientific Data Management ISIC Petabytes Petabytes Scientific Simulations & experiments • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: GTech, NCSU, NWU, SDSC Tapes Disks Tapes Disks Terabytes Terabytes • Climate Modeling • Astrophysics • Genomics and Proteomics • High Energy Physics SDM-ISIC Technology • Optimizing shared access from mass storage systems • Metadata and knowledge- based federations • API for Grid I/O • High-dimensional cluster analysis • High-dimensional indexing • Adaptive file caching • Agents … Data Manipulation: Data Manipulation: ~20% time • Using SDM-ISIC technology • Getting files from Tape archive • Extracting subset of data from files • Reformatting data • Getting data from heterogeneous, distributed systems • moving data over the network ~80% time Scientific Analysis & Discovery ~80% time Goals • Optimize and simplify: • access to very large datasets • access to distributed data • access of heterogeneous data • data mining of very large datasets Scientific Analysis & Discovery ~20% time Current Goal

  42. Further Information http://www.npaci.edu/DICE

More Related