1 / 43

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher

lucien
Download Presentation

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/

  2. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath M. Kulrul L. Sui Undergraduate Interns N. Cotofana M. Shumaker J. Trang L. Yin +/- NN Data and Knowledge Systems Group

  3. Topics • Application of • Data management systems • Information management systems • Knowledge management systems • to • Distributed data collections • Digital libraries • Data Grids • Persistent Archives • by • Defining levels of abstraction

  4. Information Management Projects • Digital Libraries • CDL - AMICO • DARPA/USPTO - patent digital library • NLM Visible Embryo digital library - GMU • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech 2MASS sky survey • NSF NSDL - UCAR / Columbia / Cornell / UCSB • Data Grid Environments • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames • NIH Biomedical Informatics Research Network • NSF Grid Physics Network - U Florida • NSF National Virtual Observatory - Johns Hopkins University / Caltech • NSF Southern California Earthquake Center - ISI • Persistent Archives • NARA Persistent Archive • NHPRC - Archivist workbench

  5. Managing Distributed Storage • Separate the organization of digital objects from their physical storage • Logical Name Space to manage attributes about the digital objects • Data handling system to manage interactions with remote storage systems • Create storage abstraction layer • Storage Resource Broker (SRB) provides data management system

  6. Information Management- Logical Name Space • Set of attributes to describe digital entities that are registered into the logical name space • SRB metadata - Unix file system semantics • Provenance metadata - Dublin Core • Resource metadata - User access control lists • Discipline metadata - User defined attributes • Each digital entity may have unique attributes

  7. Information Management • Abstraction layer for interacting with information repositories • Manage the schema and physical table structures of a database • Extensible schema • User defined attributes • Extensible Metadata CATalog (EMCAT) manages collections • mySRB.html interface supports dynamic collection creation

  8. Knowledge Management - Discovery across Collections • Characterization of relationships between attributes • Semantic / logical - cross-walks • Procedural / temporal - records management • Structural / spatial - GIS • Abstraction layer for knowledge repositories • Mapping from collection attributes to discipline concepts • Model-based Mediation supports mapping from knowledge relationships to rule-based inference engines

  9. Presentation of Digital Objects Application Operating System Storage System Display System Digital Object

  10. Technology Management Application Wrap Application Operating System Storage System Display System Digital Object

  11. Technology Management Application Add Operating System Call Operating System Storage System Display System Digital Object

  12. Technology Management Application Add Operating System Call Operating System Add Operating System Call Storage System Display System Digital Object

  13. Technology Management Application Add Operating System Call Operating System Wrap Storage System Wrap Display System Storage System Display System Digital Object

  14. Technology Management Application Operating System Storage System Display System Migrate Encoding Format Digital Object

  15. Specifying levels of Abstraction • Technology management becomes simpler if the persistent archive infrastructure operates on abstractions, rather than an explicit physical implementation of a resource • Can we abstract • Digital object • Storage

  16. Technology Management Application Operating System Storage System Abstraction Display System Abstraction Storage System Display System Digital Object Abstraction Digital Object

  17. Types of Digital Entity Abstractions • Logical representation • What does the digital entity represent? • What is the associated meaning? • Physical representation • What is the physical structure of the digital entity?

  18. Levels of Abstraction for Bits Logical: I-nodes Physical: Track / Sector Abstraction for Digital Entity Digital Entity Bit Stream Abstraction for Repository Logical: File Name Physical: File System (NFS/AFS/NTFS) Repository Disk

  19. Levels of Abstraction for Data Logical: Data Model (units, semantics) Physical: Encoding Format (syntax, structure) Abstraction for Digital Entity Digital Entity Files Abstraction for Repository Logical: Name Space Physical: Data Handling System -SRB/MCAT Repository File System, Archive

  20. Levels of Abstraction for Information Logical: Collection Schema Physical: XML Syntax Abstraction for Digital Entity Digital Entity Metadata Attributes Abstraction for Repository Logical: Database Schema Physical: EMCAT/CWM Repository Database

  21. Levels of Abstraction for Knowledge Logical: Relationship Schema Physical: ER/UML/XMI/ RDF syntax Abstraction for Digital Entity Concept Space (ontology instance) Digital Entity Abstraction for Repository Logical: Knowledge Repository Schema Physical: Model-based Mediation System Repository Knowledge Repository

  22. Information Management Projects • Digital Libraries • CDL - AMICO • DARPA/USPTO - patent digital library • NLM Visible Embryo digital library - GMU • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech 2MASS sky survey • NSF NSDL - UCAR / Columbia / Cornell / UCSB • Data Grids • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames • NIH Biomedical Informatics Research Network • NSF Grid Physics Network - U Florida • NSF National Virtual Observatory - Johns Hopkins University / Caltech • NSF Southern California Earthquake Center - ISI • Persistent Archives • NARA Persistent Archive • NHPRC - Archivist workbench

  23. Evolution of Data Management Collection - managed data Use database to organize attributes about data objects Separate information management from data storage Support APIs for information discovery, data access Database A Storage Storage Resource Broker Integration accomplished through a data handling system which characterizes the storage systems

  24. Application Resource, User Java, NT Browsers Prolog Predicate C, C++, Linux I/O Unix Shell Third-party copy Web User Defined SRB Remote Proxies MCAT Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX HRM Dublin Core DataCutter Application Meta-data SDSC Storage Resource Broker & Meta-data Catalog

  25. Evolution of Data Management Distributed Data Collection Same name space Same schema Separate administration domains Heterogeneous database instances Database A Database B Storage Resource Broker Integration requires the ability to characterize both the schemas and the table structures of each information repository

  26. Distributed Data Collection • Logical organization of distributed digital objects into a collection • Access through federated servers • Collection-owned data, implies the server at each storage repository runs under a collection user-ID • Collection attributes define a global namespace • Self-consistent attribute update on all data accesses • Support for multiple access APIs • Extensible support for access to any type of storage system (archive, file system, database) • Extensible collection attributes

  27. Interoperability across Data and Information Repositories • Define a representation for storage that is independent of the implementation of the storage system • Unix file system semantics - Open/Close/Read/Write/Seek • Define a representation of a collection that is independent of the choice of database • schema, table structures

  28. Visible Embryo Project Disk Cache AFIP: Collab WS Image Generation OHSU Eolas GST ATD Net NIC Disk Cache UIC Startap ASX200 BEN MSWS NT WS MSWS NT WS Oakland HSCC WRL 100 Gbit Vegas OC-3 JHU Disk Cache DS3 Los Angeles VBNS OC-12 Abilene OC-3 GMU Disk Cache DC POP OC-3 Abilene OC-3 SDSC Archive

  29. Data Grids Data Grid - linking multiple data collections Separate name spaces Separate schema Separate administration domains Heterogeneous database instances Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and manage semantics

  30. National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.

  31. Federated Digital Libraries Virtual Data Grid - linking multiple data collections Ability to execute processes to recreate derived data Database A Services Virtual Data Grid Database B Services The virtual data grid integrates data grid and digital library technology to manage processes

  32. Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry referenced items & collections Core Services: metadata normalizing CI Services personalization referenced items & collections Referenced Items & Collections Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces NSDL Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building

  33. Persistent Archive Persistent archive Describe archived data as collections Describe processes used to create collections Manage evolution of technology Database A (today) Virtual Data Grid Database A (tomorrow) The persistent archive is itself a virtual data grid that provides mechanisms to manage migration to new technology

  34. Persistent Archives • Storage system abstraction • Logical name space and data manipulations • Information repository abstraction • Logical schema and physical table structure • Knowledge repository abstraction • Topic maps and inference rules • Digital object abstraction • Data model and encoding format

  35. Persistent Collection • Define context for archiving data -annotate information content • Create archivable form - standard encoding format • Archive information content along with data • Test closure of the collection - all digital objects that can be discovered in the collection are members of the collection • Test completeness of the collection - inherent relationships within the collection can be cast in terms of attributes generated from the annotated information. • Differentiate between inherent knowledge and anomalies / artifacts

  36. Self-Instantiating Archive • Archive the processes that are used to control the ingestion process • Conversion to archivable form • Annotation of information content • When accessing the collection, retrieve the processes and the original digital objects • Apply the processing steps to re-create the information content • Query the result to discover desired digital objects • A self-instantiating archive is a virtual data grid

  37. ERA Concept model

  38. Data Management Systems • Distributed data collections • Single name space • Distributed data storage systems • Data Grid - integration of multiple data collections • Each collection has a separate name space • Infrastructure that interconnects the collections can use its own name space, containers, replication • Virtual Data Grids - federation of digital libraries • In addition, support interoperability between services for manipulation, presentation, discovery of digital objects • Persistent archive • In addition, manage evolution of technology components

  39. Differentiating between Data, Information, and Knowledge • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

  40. Knowledge Management • Must manage semantic relationships between the multiple name spaces • Data Grid • Must manage procedural relationships between digital library services • Federated digital library • Must manage structural relationships between different archivable forms - encoding formats • Persistent archive

  41. Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis

  42. Knowledge Based Data Grids Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) XML DTD Information Repository Attribute- based Query Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  43. Further Information http://www.npaci.edu/DICE

More Related