Scientific Data Management
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Scientific Data Management Center (ISIC) PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

Scientific Data Management Center (ISIC). http://sdmcenter.lbl.gov contains extensive publication list. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross ANL Arie Shoshani, Doron Rotem LBNL

Download Presentation

Scientific Data Management Center (ISIC)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Scientific data management center isic

Scientific Data Management

Center

(ISIC)

http://sdmcenter.lbl.gov

contains extensive publication list


Scientific data management center isic

Scientific Data Management Center

Participating Institutions

  • Center PI:

    • Arie Shoshani LBNL

  • DOE Laboratories co-PIs:

    • Bill Gropp, Rob RossANL

    • Arie Shoshani, Doron RotemLBNL

    • Terence Critchlow, Chandrika KamathLLNL

    • Nagiza Samatova, Andy WhiteORNL

  • Universities co-PIs :

    • Mladen Vouk North Carolina State

    • Alok Choudhary Northwestern

    • Reagan Moore, Bertram Ludaescher UC San Diego (SDSC)

    • Calton PuGeorgia Tech

    • Steve ParkerU of Utah (future)


Phases of scientific exploration

Phases of Scientific Exploration

  • Data Generation

    • From large scale simulations or experiments

    • Fast data growth with computational power

    • examples

      • HENP: 100 Teraops and 10 Petabytes by 2006

      • Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20

    • Problems

      • Can’t dump the data to storage fast enough – waste of compute resources

      • Can’t move terabytes of data over WAN robustly – waste of scientist’s time

      • Can’t steer the simulation – waste of time and resource

      • Need to reorganize and transform data – large data intensive tasks slowingprogress


Phases of scientific exploration1

Phases of Scientific Exploration

  • Data Analysis

    • Analysis of large data volume

    • Can’t fit all data in memory

    • Problems

      • Find the relevant data – need efficient indexing

      • Cluster analysis – need linear scaling

      • Feature selection – efficient high-dimensional analysis

      • Data heterogeneity – combine data from diverse sources

      • Streamline analysis steps – output of one step needs to match input of next


Example data flow in tsi

Example Data Flow in TSI

Logistical Network

Courtesy: John Blondin


Scientific data management center isic

Goal: Reduce the Data Management Overhead

  • Efficiency

    • Example: parallel I/O, indexing, matching storage structures to the application

  • Effectiveness

    • Example: Access data by attributes-not files, facilitate massive data movement

  • New algorithms

    • Example: Specialized PCA techniques to separate signals or to achieve better spatial data compression

  • Enabling ad-hoc exploration of data

    • Example: by enabling exploratory “run and render” capability to analyze and visualize simulation output while the code is running


Approach

Approach

SDM Framework

  • Use an integrated framework that:

    • Provides a scientific workflow capability

    • Supports data mining and analysis tools

    • Accelerates storage and access to data

  • Simplify data management tasks for the scientist

    • Hide details of underlying parallel and indexingtechnology

    • Permit assembly of modules using a simple graphical workflow description tool

Scientific

Process

Automation

Layer

Data

Mining &

Analysis

Layer

Scientific

Application

Scientific

Understanding

Storage

Efficient

Access

Layer


Technology details by layer

Technology Details by Layer


Accomplishments storage efficient access sea

P0

P1

P2

P3

netCDF

Parallel File System

P0

P1

P2

P3

Parallel netCDF

Parallel File System

Accomplishments:Storage Efficient Access (SEA)

Shared memory communication

Parallel Virtual File System:

Enhancements and deployment

  • Developed Parallel netCDF

    • Enables high performance parallel I/O to netCDF datasets

    • Achieves up to 10 fold performance improvement over HDF5

  • Enhanced ROMIO:

    • Provides MPI access to PVFS

    • Advanced parallel file system interfaces for more efficient access

  • Developed PVFS2

    • Adds Myrinet GM and InfiniBand support

    • improved fault tolerance

    • asynchronous I/O

    • offered by Dell and HP for Clusters

  • Deployed an HPSS Storage Resource Manager (SRM) with PVFS

    • Automatic access of HPSS files to PVFS through MPI-IO library

    • SRM is a middleware component

After

Before

FLASH I/O Benchmark Performance (8x8x8 block sizes)


Robust multi file replication

Anywhere

DataMover

Get list

of files

SRM-COPY

(thousands of files)

NCAR

LBNL

SRM-GET (one file at a time)

SRM

(performs writes)

SRM

(performs reads)

GridFTP GET (pull mode)

MSS

Network transfer

archive files

stage files

Disk

Cache

Disk

Cache

Robust Multi-file Replication

  • Problem: move thousands of files robustly

    • Takes many hours

    • Need error recovery

    • Mass storage systems failures

    • Network failures

    • Use Storage Resource Managers (SRMs)

  • Problem: too slow

    • Use parallel streams

    • Use concurrent transfers

    • Use large FTP windows

    • Pre-stage files from MSS


Scientific data management center isic

File tracking helps to identify bottlenecks

Shows that archiving is the bottleneck


File tracking shows recovery from transient failures

File tracking shows recovery from transient failures

Total:

45 GBs


Accomplishments data mining and analysis dma

Accomplishments:Data Mining and Analysis (DMA)

  • Developed Parallel-VTK

    • Efficient 2D/3D Parallel Scientific Visualization for NetCDF and HDF files

    • Built on top of PnetCDF

  • Developed “region tracking” tool

    • For exploring 2D/3D scientific databases

    • Using bitmap technology to identify regions based on multi-attribute conditions

  • Implemented Independent Component Analysis (ICA) module

    • Used for accurate for signal separation

    • Used for discovering key parameters that correlate with observed data

  • Developed highly effective data reduction

    • Achieves 15 fold reduction with high level of accuracy

    • Using parallel Principle Component Analysis(PCA) technology

  • Developed ASPECT

    • A framework that supports a rich set ofpluggable data analysis tools

    • Including all the tools above

    • A rich suite of statistical tools based on R package

Combustion region tracking

El Nino signal (red) and estimation (blue) closely match


Scientific data management center isic

Data Select  Data Access Correlate  Render  Display

(temp, pressure)From astro-data Where (step=101)(entropy>1000);

Sample (temp, pressure)

Run R analysis

Run pVTK filter

Visualize scatter plot in QT

ASPECT Analysis Environment

pVTK

Tool

R Analysis

Tool

Select

Data

Take

Sample

Data Mining & Analysis Layer

Read Data

(buffer-name)

Write Data

Read Data

(buffer-name)

Write Data

Read Data

(buffer-name)

Get variables

(var-names, ranges)

Use Bitmap

(condition)

Bitmap

Index

Selection

Storage Efficient

Access Layer

PVFS

Parallel

NetCDF

Hardware, OS, and MSS (HPSS)


Accomplishments scientific process automation spa

Accomplishments:Scientific Process Automation (SPA)

Unique requirements of scientific WFs

  • Moving large volumes between modules

    • Tightlly-coupled efficient data movement

  • Specification of granularity-based iteration

    • e.g. In spatio-temporal simulations – a time step is a “granule”

  • Support for data transformation

    • complex data types (including file formats, e.g. netCDF, HDF)

  • Dynamic steering of workflow by user

    • Dynamic user examination of results

      Developed a working scientific work flow system

  • Automatic microarray analysis

  • Using web-wrapping tools developed by the center

  • Using Kepler WF engine

  • Kepler is an adaptation of the UC Berkeley tool, Ptolemy

workflow steps defined graphically

workflow results presented to user


Gui for setting up and running workflows

GUI for setting up and running workflows


Re applying technology

Re-applying Technology

Technology

Parallel NetCDF

Parallel VTK

Compressed bitmaps

Storage Resource

Managers

Feature Selection

Scientific Workflow

SDM technology, developed for one application, can be effectively targeted at many other applications …

Initial Application

Astrophysics

Astrophysics

HENP

HENP

Climate

Biology

New Applications

Climate

Climate

Combustion, Astrophysics

Astrophysics

Fusion

Astrophysics (planned)


Broad impact of the sdm center

Broad Impact of the SDM Center…

Astrophysics:

High speed storage technology, parallel NetCDF, parallel VTK, and ASPECT integration software used for Terascale Supernova Initiative (TSI) and FLASH simulations

Tony Mezzacappa – ORNL, John Blondin –NCSU, Mike Zingale – U of Chicago, Mike Papka – ANL

Climate:

High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects

Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR

Combustion:

Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over time

Wendy Koegler, Jacqueline Chen – Sandia Lab

ASCI FLASH – parallel NetCDF

Dimensionality reduction

Region growing


Broad impact cont

Broad Impact (cont.)

Biology:

Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray data

Matt Coleman - LLNL

High Energy Physics:

Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSS

Doug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL

Fusion:

A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak

Keith Burrell - General Atomics

Building a scientific workflow

Dynamic monitoring of HPSS file transfers

Identifying key

parametersfor the

DIII-D Tokamak


Goals for years 4 5

Goals for Years 4-5

  • Fully develop the integrated SDM framework

    • Implement the 3 layer framework on SDM center facility

    • Provide a way to select only components needed

    • Develop self-guiding web pages on the use of SDM components

    • Use existing successful examples as guides

  • Generalize components for reuse

    • Develop general interfaces between components in the layers

    • support loosely-coupled WSDL interfaces

    • Support tightly-coupled components for efficient dataflow

  • Integrate operation of components in the framework

    • Hide details form user – automate parallel access and indexing

    • Develop a reusable library of components that can be selected for use in the workflow system


  • Login