slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt PowerPoint Presentation
Download Presentation
Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt

Loading in 2 Seconds...

play fullscreen
1 / 43

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt' - lucien


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Digital Libraries, Data Grids, and Persistent Archives

Reagan W. Moore

San Diego Supercomputer Center

moore@sdsc.edu

http://www.npaci.edu/DICE/

data and knowledge systems group
Staff

Reagan Moore

Ilkai Altintas

Chaitan Baru

Sheau Yen Chen

Charles Cowart

Amarnath Gupta

George Kremenek

Bertram Ludäscher

Richard Marciano

XuFei Qian

Roman Olshanowsky

Arcot Rajasekar

Abe Singer

Michael Wan

Ilya Zaslavsky

Bing Zhu

Graduate Students

A. Bagchi

S. Bansal

A. Behere

R. Bharath

S. Bharath

M. Kulrul

L. Sui

Undergraduate Interns

N. Cotofana

M. Shumaker

J. Trang

L. Yin

+/- NN

Data and Knowledge Systems Group
topics
Topics
  • Application of
    • Data management systems
    • Information management systems
    • Knowledge management systems
  • to
    • Distributed data collections
    • Digital libraries
    • Data Grids
    • Persistent Archives
  • by
    • Defining levels of abstraction
information management projects
Information Management Projects
  • Digital Libraries
    • CDL - AMICO
    • DARPA/USPTO - patent digital library
    • NLM Visible Embryo digital library - GMU
    • NSF Digital Library Initiative, Phase II - UCSB, Stanford
    • NSF NPACI Digital Sky - Caltech 2MASS sky survey
    • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grid Environments
    • DOE Data Visualization Corridor - LLNL
    • DOE Particle Physics Data Grid - Stanford, Caltech
    • NASA Information Power Grid - NASA Ames
    • NIH Biomedical Informatics Research Network
    • NSF Grid Physics Network - U Florida
    • NSF National Virtual Observatory - Johns Hopkins University / Caltech
    • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
    • NARA Persistent Archive
    • NHPRC - Archivist workbench
managing distributed storage
Managing Distributed Storage
  • Separate the organization of digital objects from their physical storage
    • Logical Name Space to manage attributes about the digital objects
    • Data handling system to manage interactions with remote storage systems
  • Create storage abstraction layer
  • Storage Resource Broker (SRB) provides data management system
information management logical name space
Information Management- Logical Name Space
  • Set of attributes to describe digital entities that are registered into the logical name space
      • SRB metadata - Unix file system semantics
      • Provenance metadata - Dublin Core
      • Resource metadata - User access control lists
      • Discipline metadata - User defined attributes
  • Each digital entity may have unique attributes
information management
Information Management
  • Abstraction layer for interacting with information repositories
    • Manage the schema and physical table structures of a database
    • Extensible schema
    • User defined attributes
  • Extensible Metadata CATalog (EMCAT) manages collections
  • mySRB.html interface supports dynamic collection creation
knowledge management discovery across collections
Knowledge Management - Discovery across Collections
  • Characterization of relationships between attributes
    • Semantic / logical - cross-walks
    • Procedural / temporal - records management
    • Structural / spatial - GIS
  • Abstraction layer for knowledge repositories
  • Mapping from collection attributes to discipline concepts
  • Model-based Mediation supports mapping from knowledge relationships to rule-based inference engines
presentation of digital objects
Presentation of Digital Objects

Application

Operating System

Storage System

Display System

Digital Object

technology management
Technology Management

Application

Wrap Application

Operating System

Storage System

Display System

Digital Object

technology management11
Technology Management

Application

Add Operating System Call

Operating System

Storage System

Display System

Digital Object

technology management12
Technology Management

Application

Add Operating System Call

Operating System

Add Operating System Call

Storage System

Display System

Digital Object

technology management13
Technology Management

Application

Add Operating System Call

Operating System

Wrap Storage System

Wrap Display System

Storage System

Display System

Digital Object

technology management14
Technology Management

Application

Operating System

Storage System

Display System

Migrate Encoding Format

Digital Object

specifying levels of abstraction
Specifying levels of Abstraction
  • Technology management becomes simpler if the persistent archive infrastructure operates on abstractions, rather than an explicit physical implementation of a resource
  • Can we abstract
    • Digital object
    • Storage
technology management16
Technology Management

Application

Operating System

Storage System Abstraction

Display System Abstraction

Storage System

Display System

Digital Object Abstraction

Digital Object

types of digital entity abstractions
Types of Digital Entity Abstractions
  • Logical representation
    • What does the digital entity represent?
    • What is the associated meaning?
  • Physical representation
    • What is the physical structure of the digital entity?
levels of abstraction for bits
Levels of Abstraction for Bits

Logical:

I-nodes

Physical:

Track / Sector

Abstraction for Digital Entity

Digital Entity

Bit Stream

Abstraction for

Repository

Logical:

File Name

Physical:

File System

(NFS/AFS/NTFS)

Repository

Disk

levels of abstraction for data
Levels of Abstraction for Data

Logical:

Data Model

(units, semantics)

Physical:

Encoding Format

(syntax, structure)

Abstraction for Digital Entity

Digital Entity

Files

Abstraction for

Repository

Logical:

Name Space

Physical:

Data Handling

System -SRB/MCAT

Repository

File System, Archive

levels of abstraction for information
Levels of Abstraction for Information

Logical:

Collection

Schema

Physical:

XML Syntax

Abstraction for Digital Entity

Digital Entity

Metadata Attributes

Abstraction for

Repository

Logical:

Database

Schema

Physical:

EMCAT/CWM

Repository

Database

levels of abstraction for knowledge
Levels of Abstraction for Knowledge

Logical:

Relationship

Schema

Physical:

ER/UML/XMI/

RDF syntax

Abstraction for Digital Entity

Concept Space

(ontology instance)

Digital Entity

Abstraction for

Repository

Logical:

Knowledge

Repository Schema

Physical:

Model-based

Mediation System

Repository

Knowledge Repository

information management projects22
Information Management Projects
  • Digital Libraries
    • CDL - AMICO
    • DARPA/USPTO - patent digital library
    • NLM Visible Embryo digital library - GMU
    • NSF Digital Library Initiative, Phase II - UCSB, Stanford
    • NSF NPACI Digital Sky - Caltech 2MASS sky survey
    • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grids
    • DOE Data Visualization Corridor - LLNL
    • DOE Particle Physics Data Grid - Stanford, Caltech
    • NASA Information Power Grid - NASA Ames
    • NIH Biomedical Informatics Research Network
    • NSF Grid Physics Network - U Florida
    • NSF National Virtual Observatory - Johns Hopkins University / Caltech
    • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
    • NARA Persistent Archive
    • NHPRC - Archivist workbench
evolution of data management
Evolution of Data Management

Collection - managed data

Use database to organize attributes about data objects

Separate information management from data storage

Support APIs for information discovery, data access

Database A

Storage

Storage Resource Broker

Integration accomplished through a data handling system

which characterizes the storage systems

slide24

Application

Resource,

User

Java, NT

Browsers

Prolog

Predicate

C, C++,

Linux I/O

Unix

Shell

Third-party

copy

Web

User

Defined

SRB

Remote

Proxies

MCAT

Databases

DB2, Oracle,

Postgres

Archives

HPSS, ADSM,

UniTree, DMF

File Systems

Unix, NT,

Mac OSX

HRM

Dublin

Core

DataCutter

Application

Meta-data

SDSC Storage Resource Broker

& Meta-data Catalog

evolution of data management25
Evolution of Data Management

Distributed Data Collection

Same name space

Same schema

Separate administration domains

Heterogeneous database instances

Database A

Database B

Storage Resource Broker

Integration requires the ability to characterize both the

schemas and the table structures of each information repository

distributed data collection
Distributed Data Collection
  • Logical organization of distributed digital objects into a collection
    • Access through federated servers
    • Collection-owned data, implies the server at each storage repository runs under a collection user-ID
    • Collection attributes define a global namespace
    • Self-consistent attribute update on all data accesses
    • Support for multiple access APIs
    • Extensible support for access to any type of storage system (archive, file system, database)
    • Extensible collection attributes
interoperability across data and information repositories
Interoperability across Data and Information Repositories
  • Define a representation for storage that is independent of the implementation of the storage system
    • Unix file system semantics - Open/Close/Read/Write/Seek
  • Define a representation of a collection that is independent of the choice of database
    • schema, table structures
slide28

Visible Embryo Project

Disk

Cache

AFIP:

Collab WS

Image

Generation

OHSU

Eolas

GST

ATD Net

NIC

Disk

Cache

UIC

Startap

ASX200

BEN

MSWS

NT WS

MSWS

NT WS

Oakland

HSCC

WRL

100

Gbit

Vegas

OC-3

JHU

Disk

Cache

DS3

Los

Angeles

VBNS

OC-12

Abilene

OC-3

GMU

Disk

Cache

DC POP

OC-3

Abilene

OC-3

SDSC

Archive

data grids
Data Grids

Data Grid - linking multiple data collections

Separate name spaces

Separate schema

Separate administration domains

Heterogeneous database instances

Database A

Data grid

Database B

The data grid is itself a collection that provides

mechanisms to hide latency and manage semantics

slide30

National Virtual Observatory

Data Grid

1. Portals and Workbenches

2.Knowledge & Resource

Management

Bulk Data

Analysis

Metadata

View

Data

View

Catalog

Analysis

3.

Standard APIs and Protocols

Concept space

4.Grid

Security

Caching

Replication

Backup

Scheduling

Information

Discovery

Metadata

delivery

Data

Discovery

Data

Delivery

5.

Standard Metadata format, Data model, Wire format

6.

Catalog Mediator

Data mediator

Catalog/Image Specific Access

Compute Resources

Catalogs

Data Archives

Derived Collections

7.

federated digital libraries
Federated Digital Libraries

Virtual Data Grid - linking multiple data collections

Ability to execute processes to recreate derived data

Database A

Services

Virtual Data Grid

Database B

Services

The virtual data grid integrates data grid and digital library

technology to manage processes

slide32

Portals &

Clients

Portals &

Clients

Portals &

Clients

NSDL

Services

NSDL

Services

Other NSDL

Services

NSDL

Collections

NSDL

Collections

NSDL

Collections

Core Services:

annotation

CI Services

query transform

CI Services

topic-map registry

referenced

items &

collections

Core Services:

metadata normalizing

CI Services

personalization

referenced

items &

collections

Referenced

Items &

Collections

Core Collection-

Building Services

metadata harvesting

CI Services

discussion

Core Collection-

Building Services

persistent storage

CI Services

visualization...

User Interfaces

NSDL

Usage Enhancement

Delivery

Presentation

Aggregation - Channels

Information

about collections

Core NSDL Bus

Meta-data delivery

Data delivery

Query

Global Ids

Security

Network

Metadata & data

access-based

services

Virtual

Collections &

Mediators

Collection Building

persistent archive
Persistent Archive

Persistent archive

Describe archived data as collections

Describe processes used to create collections

Manage evolution of technology

Database A

(today)

Virtual Data Grid

Database A

(tomorrow)

The persistent archive is itself a virtual data grid that provides

mechanisms to manage migration to new technology

persistent archives
Persistent Archives
  • Storage system abstraction
    • Logical name space and data manipulations
  • Information repository abstraction
    • Logical schema and physical table structure
  • Knowledge repository abstraction
    • Topic maps and inference rules
  • Digital object abstraction
    • Data model and encoding format
persistent collection
Persistent Collection
  • Define context for archiving data -annotate information content
  • Create archivable form - standard encoding format
  • Archive information content along with data
  • Test closure of the collection - all digital objects that can be discovered in the collection are members of the collection
  • Test completeness of the collection - inherent relationships within the collection can be cast in terms of attributes generated from the annotated information.
    • Differentiate between inherent knowledge and anomalies / artifacts
self instantiating archive
Self-Instantiating Archive
  • Archive the processes that are used to control the ingestion process
    • Conversion to archivable form
    • Annotation of information content
  • When accessing the collection, retrieve the processes and the original digital objects
    • Apply the processing steps to re-create the information content
    • Query the result to discover desired digital objects
  • A self-instantiating archive is a virtual data grid
data management systems
Data Management Systems
  • Distributed data collections
    • Single name space
    • Distributed data storage systems
  • Data Grid - integration of multiple data collections
    • Each collection has a separate name space
    • Infrastructure that interconnects the collections can use its own name space, containers, replication
  • Virtual Data Grids - federation of digital libraries
    • In addition, support interoperability between services for manipulation, presentation, discovery of digital objects
  • Persistent archive
    • In addition, manage evolution of technology components
differentiating between data information and knowledge
Differentiating between Data, Information, and Knowledge
  • Data
    • Digital object
    • Objects are streams of bits
  • Information
    • Any tagged data, which is treated as an attribute.
    • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object
  • Knowledge
    • Relationships between attributes
    • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
knowledge management
Knowledge Management
  • Must manage semantic relationships between the multiple name spaces
    • Data Grid
  • Must manage procedural relationships between digital library services
    • Federated digital library
  • Must manage structural relationships between different archivable forms - encoding formats
    • Persistent archive
types of knowledge relationships
Types of Knowledge Relationships
  • Logical / semantic
    • Digital Library cross-walks
  • Temporal / procedural
    • Workflow systems
  • Spatial / structural
    • GIS systems
  • Functional / algorithmic
    • Scientific feature analysis
slide42

Knowledge Based Data Grids

Ingest

Services

Management

Access

Services

Relationships

Between

Concepts

Knowledge

Repository for

Rules

Knowledge or

Topic-Based

Query / Browse

Knowledge

XTM DTD

  • Rules - KQL

(Model-based Access)

XML DTD

Information

Repository

Attribute- based

Query

Attributes

Semantics

SDLIP

Information

(Data Handling System - SRB)

Data

Fields

Containers

Folders

Storage

(Replicas,

Persistent IDs)

Grids

Feature-based

Query

MCAT/HDF

further information
Further Information

http://www.npaci.edu/DICE