Persistent Management of Distributed Data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Center [email protected] PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek

Download Presentation

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Center [email protected]

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

Persistent Management of Distributed Data

Reagan W. Moore

University of California, San Diego

San Diego Supercomputer Center

[email protected]

http://www.npaci.edu/DICE/


Data and knowledge systems group

Staff

Reagan Moore

Ilkai Altintas

Chaitan Baru

Sheau Yen Chen

Charles Cowart

Amarnath Gupta

George Kremenek

Bertram Ludäscher

Richard Marciano

XuFei Qian

Roman Olshanowsky

Arcot Rajasekar

Abe Singer

Michael Wan

Ilya Zaslavsky

Bing Zhu

Graduate Students

A. Bagchi

S. Bansal

A. Behere

R. Bharath

S. Bharath

M. Kulrul

L. Sui

Undergraduate Interns

N. Cotofana

M. Shumaker

J. Trang

L. Yin

+/- NN

Data and Knowledge Systems Group


Information management projects

Information Management Projects

  • Digital Libraries

    • California Digital Library - Art Museum Image COnsortium

    • DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia

    • NLM Visible Embryo digital library - GMU, OHSI, UC, LSU, JHU

    • NSF Digital Library Initiative, Phase II - UCSB, Stanford

    • NSF NPACI Digital Sky - Caltech

    • NSF National Science Education Digital Library - UCAR, Cornell, UCSB, U Mass, Columbia

  • Data Grid Environments

    • DOE Data Visualization Corridor - LLNL, OSU

    • DOE Particle Physics Data Grid - Stanford, Caltech

    • NASA Information Power Grid - NASA Ames, USC/ISI, U Texas

    • NIH Biomedical Informatics Research Network - Duke, Harvard

    • NSF Grid Physics Network - U Florida, Caltech, U Wisc, USC/ISI, U Chicago

    • NSF National Virtual Observatory - JHU, Caltech

    • NSF Southern California Earthquake Center - USC/ISI

  • Persistent Archives

    • NARA Persistent Archive - UCB, U Maryland

    • NHPRC Archivist workbench - U Minnesota

    • NSF NSDL Persistent archive for curricula modules - Cornell


Topics

Topics

  • Data management systems

    • Data collections, digital libraries

  • Distributed data management

    • Data grids

  • Persistent data management

    • Persistent archives

  • Common infrastructure for data management


Data collections

Data Collections

  • Define the context for describing a collection of digital entities

    • Context specified by metadata attributes

    • Provenance, origin of the digital entities

    • Administrative, location of the digital entities

    • Technical, purpose of the digital entities

  • Support organization of attributes as hierarchy of sub-collections


Digital libraries

Digital Libraries

  • Provide services on the data collection

    • Ingestion, loading of attribute values

    • Extensibility, definition of new attributes

    • Discovery, queries on attributes

    • Browsing, hierarchical listing

    • Presentation, formatting specified data models


Data grids

Data Grids

  • Manage data in a distributed environment

    • Logical name space, provide global identifier

    • Data access, storage system abstraction

    • Replication, disaster back up

    • Uniform access, common API across file systems, archives, and databases

    • Single sign-on, authenticate across administration domains


Persistent archives

Persistent Archives

  • Manage technology evolution

    • Storage system abstraction, support data migration across storage systems

    • Information repository abstraction, support catalog migration to new databases

    • Logical name space, support global persistent identifier


Storage resource broker

Storage Resource Broker

  • Integration of collection-based management of digital entities, with

    • Remote data access through storage system abstraction

    • Catalog access through information repository abstraction

    • Automation through collection-owned data


Capabilities

Capabilities

  • Support legacy systems

  • Integrate archives with file systems

  • Share distributed data

  • Maintain persistent collection

  • Control data access


Uniform api

Uniform API

  • Provide common access semantics

  • Map from the interface preferred by your application to the interfaces required by legacy storage systems


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

C, C++,

Libraries

Unix

Shell

Databases

DB2, Oracle,

Postgres

Archives

HPSS, ADSM,

UniTree, DMF

File Systems

Unix, NT,

Mac OSX

SDSC Storage Resource Broker & Meta-data Catalog

Common APIs

Application

Linux

I/O

Web

WSDL

Access

APIs

DLL /

Python

Java, NT

Browsers

GridFTP

Consistency Management / Authorization-Authentication

Prime

Server

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Storage Abstraction

Databases

DB2, Oracle, Sybase

Servers

HRM


Discovery transparencies

Discovery Transparencies

  • Naming transparency - find a data set without knowing its name

    • Map from attributes to a global file name

  • Location transparency - access a data set without knowing where it is

    • Map from global file name to local file name

  • Access transparency - access a data set without knowing the type of storage system

    • Federated client-server architecture


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

C, C++,

Libraries

Unix

Shell

Databases

DB2, Oracle,

Postgres

Archives

HPSS, ADSM,

UniTree, DMF

File Systems

Unix, NT,

Mac OSX

SDSC Storage Resource Broker & Meta-data Catalog

Transparencies

Application

Linux

I/O

Web

WSDL

Access

APIs

DLL /

Python

Java, NT

Browsers

GridFTP

Consistency Management / Authorization-Authentication

Prime

Server

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Storage Abstraction

Databases

DB2, Oracle, Sybase

Servers

HRM


Persistent collection

Persistent Collection

  • Maintain authenticity

    • Authenticate all accesses

    • Assign roles for access control lists (curation, write, annotate, read)

    • Manage audit trails of all operations

  • Collection-owned data

    • All accesses through the data management system


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

C, C++,

Libraries

Unix

Shell

Databases

DB2, Oracle,

Postgres

Archives

HPSS, ADSM,

UniTree, DMF

File Systems

Unix, NT,

Mac OSX

SDSC Storage Resource Broker & Meta-data Catalog

Persistency

Application

Linux

I/O

Web

WSDL

Access

APIs

DLL /

Python

Java, NT

Browsers

GridFTP

Consistency Management / Authorization-Authentication

Prime

Server

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Storage Abstraction

Databases

DB2, Oracle, Sybase

Servers

HRM


Preservation similar requirements to a data grid

Preservation(Similar requirements to a data grid)

  • Name transparency

    • Find a file by attributes (map from attributes to global name)

  • Location transparency

    • Access a file by a global identifier (map from global to local file name)

  • Access transparency

    • Use same API to access data in archive or file cache

  • Authenticity

    • Disaster recovery, replicate data across storage systems

    • Audit and process management


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

C, C++,

Libraries

Unix

Shell

Databases

DB2, Oracle,

Postgres

Archives

HPSS, ADSM,

UniTree, DMF

File Systems

Unix, NT,

Mac OSX

SDSC Storage Resource Broker & Meta-data Catalog

Preservation

Application

Linux

I/O

Web

WSDL

Access

APIs

DLL /

Python

Java, NT

Browsers

GridFTP

Consistency Management / Authorization-Authentication

Prime

Server

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Storage Abstraction

Databases

DB2, Oracle, Sybase

Servers

HRM


Convergence of technologies

Convergence of Technologies

  • Data grids as basis for distributed data management

    • Federation of distributed resources

    • Creation of logical name space to automate discovery

  • Distributed data collections

    • Discovery based on attributes

    • Distributed data storage systems

  • Digital libraries

    • Development of services for manipulating, viewing data

  • Persistent archives

    • Management of technology evolution


Digital entities

Digital Entities

  • Digital entities are “images of reality”, made of

    • Data, the bits (zeros and ones) put on a storage system

    • Information, the attributes used to assign semantic meaning to the data

    • Knowledge, the structural relationships described by a data model

  • Every digital entity requires information and knowledge to correctly interpret and display


Differentiating between data information and knowledge

Differentiating between Data, Information, and Knowledge

  • Data

    • Digital object

    • Objects are streams of bits

  • Information

    • Any tagged data, which is treated as an attribute.

    • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object

  • Knowledge

    • Relationships between attributes

    • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional


Knowledge creation roadmap

Knowledge Creation Roadmap

  • Knowledge syntax (consensus)

    • RDF, XMI, Topic Map

  • Knowledge management (recursive operations)

    • Oracle parallel database

  • Knowledge manipulation (spatial/procedural rules)

    • Generation of inference rules and mapping to data models

  • Knowledge generation (scalable inference engine)

    • Application of inference rules in inference engine


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

Knowledge Based Data Grid Roadmap

Ingest

Services

Management

Access

Services

Relationships

Between

Concepts

Knowledge

Repository for

Rules

Knowledge or

Topic-Based

Query / Browse

Knowledge

XTM DTD

  • Rules - KQL

(Model-based Access)

Information

Repository

Attribute- based

Query

Attributes

Semantics

SDLIP

XML DTD

Information

(Data Handling System - SRB)

Data

Fields

Containers

Folders

Storage

(Replicas,

Persistent IDs)

MCAT/HDF

Grids

Feature-based

Query


Information management projects1

Information Management Projects

  • Digital Libraries

    • California Digital Library - Art Museum Image COnsortium

    • DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia

    • NLM Visible Embryo digital library - George Mason University

    • NSF Digital Library Initiative, Phase II - UCSB, Stanford

    • NSF NPACI Digital Sky - Caltech 2MASS sky survey

  • Data Grid Environments

    • DOE Data Visualization Corridor - LLNL

    • DOE Particle Physics Data Grid - Stanford, Caltech

    • NASA Information Power Grid - NASA Ames

    • NIH Biomedical Informatics Research Network - Duke, Harvard

    • NSF Grid Physics Network - U Florida

    • NSF National Virtual Observatory - Johns Hopkins University, Caltech

    • NSF Southern California Earthquake Center - USC/ISI

  • Persistent Archives

    • NARA Persistent Archive - UCB, U Maryland

    • NHPRC Archivist workbench - U Minnesota

    • NSF NSDL Persistent archive for curricula modules - Cornell University


Additional projects

Additional Projects

  • ACS Alliance for Cell Signaling

  • NSF Digital Government - Fed Web

  • DOE - SciDAC Grid Portal

  • DOE - SciDAC Scientific Data Management Project

  • Hayden Planetarium


Nasa information power grid

NASA Information Power Grid

  • Develop digital library interface to the archives at NASA Ames - SRB

  • Demonstrate high performance data access across both NASA and NSF resources

  • Demonstrate telescience through NASA resources (link electron microscope at UCSD, with image collection at NASA Ames, and processing of data at NASA Ames)


Hayden planetarium

Hayden Planetarium

  • Provide a data collaboration environment

  • Share data between NCSA (simulations of the solar system evolution), SDSC (3D visualizations), and Hayden (review)

  • Manage 3-6 TBs of data

  • Provide seamless access across administration domains and storage resources


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

The SRB is great. It has been utterly essential in this project - we

could not have done this work without the SRB. We have used the SRB

as a central shared repository for raw data, derived data, and

visualization results. Data has been submitted by, and retrieved by

each of the partners in this project on a daily basis. Email back

and forth frequently includes SRB directory names into which data or

results have been stored for broad review at different sites. The

visualization animations we've produced are immediately placed into

the SRB where they are downloaded, simultaneously, by people at NCSA

and the museum in New York. The animations are reviewed, new data

generated, email exchanged, and new images rendered and put back into

the SRB to start the next review cycle. The whole thing has worked

flawlessly. I am delighted and will gladly promote its virtues at

any opportunity.

As you've seen, I do have some suggestions for future functionality

here and there. The "migration awkwardness" is one. But these

suggestions are for added features or minor interface smoothing.

They should in no way diminish the fact that it all works wonderfully!

Thanks!

-- Dave


Visible embryo project

Visible Embryo Project

  • Build a digital library of images, reports for use by educators and physicians

  • Manage transfer of images from Armed Forces Institute of Pathology to an archive at SDSC

  • Provide access to the material

  • Use the SRB/MCAT system to assemble a digital library


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

Visible Embryo Project

Disk

Cache

AFIP:

Collab WS

Image

Generation

OHSU

Eolas

GST

ATD Net

NIC

Disk

Cache

UIC

Startap

ASX200

BEN

MSWS

NT WS

MSWS

NT WS

Oakland

HSCC

WRL

100

Gbit

Vegas

OC-3

JHU

Disk

Cache

DS3

Los

Angeles

VBNS

OC-12

Abilene

OC-3

GMU

Disk

Cache

DC POP

OC-3

Abilene

OC-3

SDSC

Archive


National virtual observatory

National Virtual Observatory

  • Federate existing sky survey image collections and catalogs

  • Support statistical analyses across multiple surveys

  • Implement services support environment

  • Use the SRB/MCAT to support bulk data access, replicate the major sky surveys, support large scale database record analyses


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

National Virtual Observatory

Data Grid

1. Portals and Workbenches

2.Knowledge & Resource

Management

Bulk Data

Analysis

Metadata

View

Data

View

Catalog

Analysis

3.

Standard APIs and Protocols

Concept space

4.Grid

Security

Caching

Replication

Backup

Scheduling

Information

Discovery

Metadata

delivery

Data

Discovery

Data

Delivery

5.

Standard Metadata format, Data model, Wire format

6.

Catalog Mediator

Data mediator

Catalog/Image Specific Access

Compute Resources

Catalogs

Data Archives

Derived Collections

7.


Particle physics data grid

Particle Physics Data Grid

  • Support replication of data sets for the high energy physics grid

  • Federate data collections for the BaBar experiment at SLAC using the SRB

  • Federate BaBar collections between SLAC and Lyons, France

  • Support web service interface, support derived data product catalog


Particle physics data grid replication system

S-Commands

S-Commands

Wisc

Client 2

SRB Server

@Wisc

Wisc

Client 1

SRB Server @LBL

SRB Server @LBL

file caching

esrb.driver

esrb.driver

IPC

IPC

Stage() purge() fileStatus()

file purging

File caching request

Stage() purge() fileStatus()

FC

HRM

HPSS

esrb.server

Stage() purge() fileStatus()

Disk cache

Particle Physics Data Grid - Replication System


National stem education digital library nsdl

National STEM Education Digital Library - NSDL

  • Provide persistent archive for educational material indexed in the NSDL repository

  • Develop knowledge spaces to characterize collection holdings

  • Map knowledge spaces to AAAS2061 concept space, and to state mandated grade level concepts


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

Portals &

Clients

Portals &

Clients

Portals &

Clients

NSDL

Services

NSDL

Services

Other NSDL

Services

NSDL

Collections

NSDL

Collections

NSDL

Collections

Core Services:

annotation

CI Services

query transform

CI Services

topic-map registry

referenced

items &

collections

Core Services:

metadata normalizing

CI Services

personalization

referenced

items &

collections

Referenced

Items &

Collections

Core Collection-

Building Services

metadata harvesting

CI Services

discussion

Core Collection-

Building Services

persistent storage

CI Services

visualization...

User Interfaces

NSDL

Usage Enhancement

Delivery

Presentation

Aggregation - Channels

Information

about collections

Core NSDL Bus

Meta-data delivery

Data delivery

Query

Global Ids

Security

Network

Metadata & data

access-based

services

Virtual

Collections &

Mediators

Collection Building


National archives records administration

National Archives Records Administration

  • Develop prototype persistent archive for NARA digital holdings

  • Identify pertinent research areas for long term preservation of data, information, and knowledge

  • Apply data grid technology for the implementation of persistent archives.


Grid physics network

Grid Physics Network

  • Develop infrastructure to support virtual data products

  • Create repository for derived data products

    • Automate the extraction of metadata from Virtual Data Language files

    • Automate extraction of administrative data through grid portals


Gridport srb architecture

GridPort + SRB Architecture

  • With SRB capabilities, file access is direct, uniform

  • Uses same authentication as portal and other Grid services

  • Single SRB account access allows for more flexible data management


Doe scientific data management

DOE Scientific Data Management

  • Develop knowledge management tools to mediate between biological information resources

  • Integrate the tools into the DOE scientific data management environment


Persistent management of distributed data reagan w moore university of california san diego san diego supercomputer

Scientific Data Management ISIC

Petabytes

Petabytes

Scientific

Simulations

& experiments

  • DOE Labs: ANL, LBNL, LLNL, ORNL

  • Universities: GTech, NCSU, NWU, SDSC

Tapes

Disks

Tapes

Disks

Terabytes

Terabytes

  • Climate Modeling

  • Astrophysics

  • Genomics and Proteomics

  • High Energy Physics

SDM-ISIC Technology

  • Optimizing shared access from mass storage systems

  • Metadata and knowledge- based federations

  • API for Grid I/O

  • High-dimensional cluster analysis

  • High-dimensional indexing

  • Adaptive file caching

  • Agents …

Data

Manipulation:

Data

Manipulation:

~20%

time

  • Using SDM-ISIC technology

  • Getting files from Tape archive

  • Extracting subset of data from files

  • Reformatting data

  • Getting data from heterogeneous, distributed systems

  • moving data over the network

~80%

time

Scientific

Analysis

& Discovery

~80%

time

Goals

  • Optimize and simplify:

  • access to very large datasets

  • access to distributed data

  • access of heterogeneous data

  • data mining of very large datasets

Scientific

Analysis

& Discovery

~20%

time

Current

Goal


Further information

Further Information

http://www.npaci.edu/DICE


  • Login