Modern Data Management Overview - PowerPoint PPT Presentation

Modern data management overview l.jpg
Download
1 / 36

Modern Data Management Overview. Storage Resource Broker. Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb. Topics. Data management evolution Shared collections Digital Libraries Persistent Archives Building shared collections Project level / National level / International

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Modern Data Management Overview

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Modern data management overview l.jpg

Modern Data Management Overview

Storage Resource Broker

Reagan W. Moore

moore@sdsc.edu

http://www.sdsc.edu/srb


Topics l.jpg

Topics

  • Data management evolution

    • Shared collections

    • Digital Libraries

    • Persistent Archives

  • Building shared collections

    • Project level / National level / International

  • Demonstration of shared collections

    • Access to collections at SDSC


Types of data management l.jpg

Types of Data Management

  • File system (AFS)

    • Provides caching at remote sites, uses single authentication system

  • Backup system (Veritas)

    • Provides time-based copies of data, tools for re-loading backups

  • Database system (Oracle 10g IFS)

    • Can link metadata to files on an Internet File System

  • Archive system (HPSS)

    • Manages data stored on tape, supports parallel I/O streams

  • Persistent object environment (Avaki)

    • Provides vaults for storing objects

  • Globus toolkit

    • Provides differentiated services for building a data grid


Data management environments l.jpg

Data Management Environments

  • Data grids

    • Manage shared collections

  • Digital libraries

    • Provide discovery, browsing, presentation services on top of collections

  • Persistent archives

    • Manage technology evolution while the authenticity and integrity of the assembled collection is preserved

  • Real-time sensor networks

    • Manage access to real-time data streams from thousands of sensors


Generic infrastructure l.jpg

Generic Infrastructure

  • Can a single system provide all of the features needed to implement each type of data management system, while supporting access across administrative domains and managing data stored in multiple types of storage systems?

  • Answer is data grid technology


Types of data management6 l.jpg

Types of Data Management

  • File system (AFS)

    • Data grid manages replication, parallel I/O, containers

  • Backup system (Veritas)

    • Data grid supports replicas, versions, and snapshots of files and containers

  • Database system (Oracle 10g IFS)

    • Data grid virtualizes catalogs - schema extension, bulk metadata load

  • Archive system (HPSS)

    • Data grid integrates access across archives and file systems

  • Persistent object environment (Avaki)

    • Data grid manages user-defined metadata and collection hierarchy

  • Globus toolkit - set of differentiated services

    • Data grid manages consistent state information


Shared collections l.jpg

Shared Collections

  • Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions

    • Register digital entity into the shared collection

    • Assign owner, access controls

    • Assign descriptive, provenance metadata

    • Manage state information

      • Audit trails, versions, replicas, backups, locks

      • Size, checksum, validation date, synchronization date, …

    • Manage interactions with storage systems

      • Unix file systems, Windows file systems, tape archives, …

    • Manage interactions with preferred access mechanisms

      • Web browser, Java, WSDL, C library, …


Federated server architecture l.jpg

Federated Server Architecture

Peer-to-peer Brokering

Read Application

Parallel Data Access

Logical Name

Or

Attribute Condition

1

6

5/6

SRB

server

SRB

server

3

4

5

SRB agent

SRB agent

2

Server(s) Spawning

R1

MCAT

1.Logical-to-Physical mapping

2.Identification of Replicas

3.Access & Audit Control

R2

Data Access


Generic infrastructure9 l.jpg

Generic Infrastructure

  • Digital libraries now build upon data grids to manage distributed collections

    • DSpace digital library - MIT and Hewlitt Packard

    • Fedora digitial library - Cornell University and University of Virginia

  • Persistent archives build upon data grids to manage technology evolution

    • NARA research prototype persistent archive

    • California Digital Library - Digital Preservation Repository

    • NSF National Science Digital Library persistent archive


Southern california earthquake center l.jpg

Southern California Earthquake Center

Select Receiver (Lat/Lon)

Output

Time History

Seismograms

Select Scenario

Fault Model

Source Model

SCEC Community Library

  • Intuitive User Interface

    • Pull-Down Query Menus

    • Graphical Selection of Source Model

    • Clickable LA Basin Map (Olsen)

    • Seismogram/History extraction (Olsen)

  • Access SCEC Digital Library

    • Data stored in a data grid

    • Annotated by modelers

    • Standard naming convention

    • Automated extraction of selected data and metadata

    • Management of visualizations

SCEC Digital Library


Terashake data handling l.jpg

Terashake Data Handling

  • Simulate 7.7 magnitude earthquake on San Andreas fault

    • 50 Terabytes in a simulation

    • Move 10 Terabytes per day

  • Post-Processing of wave field

    • Movies of seismic wave propagation

    • Seismogram formatting for interactive on-line analysis

    • Velocity magnitude

    • Displacement vector field

    • Cumulative peak maps

    • Statistics used in visualizations

    • Register derived data products into SCEC digital library


Slide12 l.jpg

Humidity

Climate

Ecological

Wireless

Oceanography

Wind Speed

Climate

Ecological

Wireless

Oceanography

ROADNet Sensor Network Data Integration

Seismic

Geophysics

Rain start

Fire start

Frank Vernon - UCSD/SIO


Chile june 13 2005 l.jpg

ChileJune 13, 2005

Mw 7.9

Frank Vernon - UCSD/SIO


National science digital library l.jpg

National Science Digital Library

  • URLs for educational material for all grade levels registered into repository at Cornell

  • SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive

    • 750,000 URLs

    • 13 million web pages

    • About 3 TBs of data


Worldwide university network data grid l.jpg

Worldwide University Network Data Grid

  • SDSC

  • Manchester

  • Southampton

  • White Rose

  • NCSA

  • U. Bergen

  • A functioning, general purpose international Data Grid for academic collaborations

Manchester-SDSC mirror


Kek data grid l.jpg

KEK Data Grid

  • Japan

  • Taiwan

  • South Korea

  • Australia

  • Poland

  • US

  • A functioning, general purpose international Data Grid for high-energy physics

Manchester-SDSC mirror


Babar high energy physics l.jpg

BaBar High-energy Physics

  • Stanford Linear Accelerator

  • Lyon, France

  • Rome, Italy

  • San Diego

  • RAL, UK

  • A functioning international Data Grid for high-energy physics

Manchester-SDSC mirror

Moved over 100 TBs of data


Astronomy data grid l.jpg

Astronomy Data Grid

  • Chile

  • Tucson, Arizona

  • NCSA, Illinois

  • A functioning international Data Grid for Astronomy

Manchester-SDSC mirror

Moved over 400,000 images


International institutions 2005 l.jpg

International Institutions (2005)


Storage resource broker 3 3 1 l.jpg

Storage Resource Broker 3.3.1

C

Library,

Java

Unix

Shell

Databases -

DB2, Oracle,

Sybase, Postgres,

mySQL, Informix

File Systems

Unix, NT,

Mac OSX

Archives - Tape,

Sam-QFS, DMF,

HPSS, ADSM,

UniTree, ADS

Application

HTTP,

DSpace,

OpenDAP,

GridFTP

OAI,

WSDL,

(WSRF)

DLL /

Python,

Perl,

Windows

Linux I/O

C++

NT Browser,

Kepler Actors

Federation Management

Consistency & Metadata Management / Authorization, Authentication, Audit

Latency

Management

Metadata

Transport

Logical Name

Space

Data

Transport

Database Abstraction

Storage Repository Abstraction

Databases -

DB2, Oracle, Sybase,

Postgres, mySQL,

Informix

ORB


Srb objectives l.jpg

SRB Objectives

  • Automate all aspects of data discovery, access, management, analysis, preservation

    • Security paramount

    • Distributed data

  • Provide distributed data support for

    • Data sharing - data grids

    • Data publication - digital libraries

    • Data preservation - persistent archives

    • Data collections - Real time sensor data


Srb developers l.jpg

SRB Developers

Reagan Moore - PI

Michael Wan - SRB Architect

Arcot Rajasekar - SRB Manager

Wayne Schroeder - SRB Productization

Charlie Cowart- inQ

Lucas Gilbert - Jargon

Bing Zhu - Perl, Python, Windows

Antoine de Torcy - mySRB web browser

Sheau-Yen Chen - SRB Administration

George Kremenek- SRB Collections

Arun Jagatheesan - Matrix workflow

Marcio Faerman - SCEC Application

Sifang Lu - ROADnet Application

Richard Marciano - SALT persistent archives

75 FTE-years of support

About 300,000 lines of C


History l.jpg

History

  • 1995 - DARPA Massive Data Analysis Systems

  • 1997 - DARPA/USPTO Distributed Object Computation Testbed

  • 1998 - NSF National Partnership for Advanced Computational Infrastructure

  • 1998 - DOE Accelerated Strategic Computing Initiative data grid

  • 1999 - NARA persistent archive

  • 2000 - NASA Information Power Grid

  • 2001 - NLM Digital Embryo digital library

  • 2001 - DOE Particle Physics data grid

  • 2001 - NSF Grid Physics Network data grid

  • 2001 - NSF National Virtual Observatory data grid

  • 2002 - NSF National Science Digital Library persistent archive

  • 2003 - NSF Southern California Earthquake Center digital library

  • 2003 - NIH Biomedical Informatics Research Network data grid

  • 2003 - NSF Real-time Observatories, Applications, and Data management Network

  • 2004 - NSF ITR, Constraint based data systems

  • 2005 - LC Digital Preservation Lifecycle Management

  • 2005 - LC National Digital Information Infrastructure and Preservation program


Development l.jpg

Development

  • SRB 1.1.8 - December 15, 2000

    • Basic distributed data management system

    • Metadata Catalog

  • SRB 2.0 - February 18, 2003

    • Parallel I/O support

    • Bulk operations

  • SRB 3.0 - August 30, 2003

    • Federation of data grids

  • SRB 3.3.1 - April 6, 2005

    • Feature requests (extensible schema)


Srb latency management l.jpg

SRB Latency Management

Remote Proxies,

Staging

Data Aggregation

Containers

Prefetch

Network

Destination

Network

Destination

Source

Caching

Client-initiated I/O

Streaming

Parallel I/O

Replication

Server-initiated I/O


Latency management bulk operations l.jpg

Latency Management -Bulk Operations

  • Bulk register

    • Create a logical name for a file

    • Load context (metadata)

  • Bulk load

    • Create a copy of the file on a data grid storage repository

  • Bulk unload

    • Provide containers to hold small files and pointers to each file location

  • Bulk delete

    • Trash can

  • Sticky bits for access control, …


Logical name spaces l.jpg

Logical Name Spaces

Data Access Methods (C library, Unix, Web Browser)

  • Storage Repository

  • Storage location

  • User name

  • File name

  • File context (creation date,…)

  • Access constraints

Data access directly between

application and storage

repository using names

required by the local

repository


Logical name spaces31 l.jpg

Logical Name Spaces

Data Access Methods (C library, Unix, Web Browser)

Data Collection

  • Storage Repository

  • Storage location

  • User name

  • File name

  • File context (creation date,…)

  • Access constraints

  • Data Grid

  • Logical resource name space

  • Logical user name space

  • Logical file name space

  • Logical context (metadata)

  • Control/consistency constraints

Data is organized as a shared collection


Federation between data grids l.jpg

Federation Between Data Grids

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Collection A

Data Collection B

  • Data Grid

  • Logical resource name space

  • Logical user name space

  • Logical file name space

  • Logical context (metadata)

  • Control/consistency constraints

  • Data Grid

  • Logical resource name space

  • Logical user name space

  • Logical file name space

  • Logical context (metadata)

  • Control/consistency constraints

Access controls and consistency constraints on cross registration of digital entities


Types of risk l.jpg

Types of Risk

  • Media failure

    • Replicate data onto multiple media

  • Vendor specific systemic errors

    • Replicate data onto multiple vendor products

  • Operational error

    • Replicate data onto a second administrative domain

  • Natural disaster

    • Replicate data to a geographically remote site

  • Malicious user

    • Replicate data to a deep archive


How many replicas l.jpg

How Many Replicas

  • Three sites minimize risk

    • Primary site

      • Supports interactive user access to data

    • Secondary site

      • Supports interactive user access when first site is down

      • Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures

    • Deep archive

      • Provides 3rd media copy, staging environment for data ingestion, no user access


State of the art technology l.jpg

State of the Art Technology

  • Grid - workflow virtualization

    • Support execution of jobs (processes) across multiple compute servers

  • Data grid - data virtualization

    • Manage a shared collection that is distributed across multiple storage servers

  • Semantic grid - information virtualization

    • Create a common understanding of information (metadata) across multiple collections.


For more information l.jpg

For More Information

Reagan W. Moore

San Diego Supercomputer Center

moore@sdsc.edu

http://www.sdsc.edu/srb/


  • Login