Virtual data toolkit
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Virtual Data Toolkit PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Virtual Data Toolkit. R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003. MCAT; GriPhyN catalogs. MDS. MDS. GDMP. DAGMAN, Condor-G. GSI, CAS. Globus. GRAM. GridFTP; GRAM; SRM. Very Early GriPhyN Data Grid Architecture. Application. = initial solution is operational.

Download Presentation

Virtual Data Toolkit

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Virtual data toolkit

Virtual Data Toolkit

R. Cavanaugh

GriPhyN

Analysis Workshop

Caltech, June, 2003


Virtual data toolkit

MCAT; GriPhyN catalogs

MDS

MDS

GDMP

DAGMAN, Condor-G

GSI, CAS

Globus

GRAM

GridFTP; GRAM; SRM

Very Early GriPhyN Data Grid Architecture

Application

= initial solution is operational

Catalog Services

Monitoring

Planner

Info Services

Repl. Mgmt.

Executor

Policy/Security

Reliable Transfer

Service

Compute Resource

Storage Resource

Caltech Analysis Workshop


Virtual data toolkit

Currently Evolved GriPhyN Picture

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Current vdt emphasis

Current VDT Emphasis

  • Current reality

    • Easy grid construction

      • Strikes a balance between flexibility and “easibility”

      • purposefully errs (just a little bit) on the side of “easibility”

    • Long running, high-throughput, file-based computing

    • Abstract description of complex workflows

    • Virtual Data Request Planning

    • Partial provenance tracking of workflows

  • Future directions (current research) including:

    • Policy based scheduling

      • With notions of Quality of Service (advanced reservation of resources, etc)

    • Dataset based (arbitrary type structures)

    • Full provenance tracking of workflows

    • Several others…

Caltech Analysis Workshop


Current vdt flavors

Client

Globus Toolkit 2

GSI

globusrun

GridFTP Client

CA signing policies for DOE and EDG

Condor-G 6.5.1 / DAGMan

RLS 1.1.8 Client

MonALISA Client (soon)

Chimera 1.0.3

SDK

Globus

ClassAds

RLS 1.1.8 Client

Netlogger 2.0.13

Server

Globus Toolkit 2.2.4

GSI

Gatekeeper

job-managers and GASS Cache

MDS

GridFTP Server

MyProxy

CA signing policies for DOE and EDG

EDG Certificate Revocation List

Fault Tolerant Shell

GLUE Schema

mkgridmap

Condor 6.5.1 / DAGMan

RLS 1.1.8 Server

MonALISA Server (soon)

Current VDT Flavors

Caltech Analysis Workshop


Chimera virtual data system

Chimera Virtual Data System

  • Virtual Data Language

    • textual

    • XML

  • Virtual Data Catalog

    • MySQL or PostGreSQL based

    • File based version available

Caltech Analysis Workshop


Virtual data language

Virtual Data Language

file1

TR CMKIN( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

TR CMSIM( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

DV x1->CMKIN( a2=@{out:file2}, a1=@{in:file1});

DV x2->CMSIM( a2=@{out:file3}, a1=@{in:file2});

x1

file2

x2

file3

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Virtual data request planning

Virtual Data Request Planning

  • Abstract Planner

    • Graph traversal of (virtual) data dependencies

    • Generates the graph with maximal data dependencies

    • Somewhat analogous to Build Style

  • Concrete (Pegasus) Planner

    • Prunes execution steps for which data already exists (RLS lookup)

    • Binds all execution steps in the graph to a site

    • Adds “housekeeping” steps

      • Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc

    • Generates a graph with minimal execution steps

    • Somewhat analogous to Make Style

Caltech Analysis Workshop


Virtual data toolkit

Chimera Virtual Data System:

Mapping Abstract Workflows onto Concrete Environments

VDL

  • Abstract DAGs (virtual workflow)

    • Resource locations unspecified

    • File names are logical

    • Data destinations unspecified

    • build style

  • Concrete DAGs (stuff for submission)

    • Resource locations determined

    • Physical file names specified

    • Data delivered to and returned from physical

    • locations

    • make style

XML

VDC

XML

Abs. Plan

Logical

DAX

RLS

C. Plan.

DAG

Physical

DAGMan

In general there is a full range of planning steps between abstract workflows and concrete workflows

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Supercomputing 2002

Supercomputing 2002

mass = 200

decay = bb

A virtual space

of simulated data

is generated for

future use by

scientists...

mass = 200

mass = 200

decay = ZZ

mass = 200

decay = WW

stability = 3

mass = 200

decay = WW

mass = 200

decay = WW

stability = 1

mass = 200

event = 8

mass = 200

decay = WW

stability = 1

event = 8

mass = 200

plot = 1

mass = 200

decay = WW

event = 8

mass = 200

decay = WW

stability = 1

plot = 1

mass = 200

decay = WW

plot = 1

Caltech Analysis Workshop


Virtual data toolkit

Supercomputing 2002

Scientists

may add new

derived data

branches...

mass = 200

decay = bb

mass = 200

mass = 200

decay = ZZ

mass = 200

decay = WW

stability = 3

mass = 200

decay = WW

stability = 1

LowPt = 20

HighPt = 10000

mass = 200

decay = WW

mass = 200

decay = WW

stability = 1

mass = 200

event = 8

mass = 200

decay = WW

stability = 1

event = 8

mass = 200

plot = 1

mass = 200

decay = WW

event = 8

mass = 200

decay = WW

stability = 1

plot = 1

mass = 200

decay = WW

plot = 1

Caltech Analysis Workshop


Virtual data toolkit

Generator

Formator

Simulator

Digitiser

writeESD

writeAOD

writeTAG

POOL

Analysis

Scripts

Example

CMS Data/

Workflow

Calib. DB

writeESD

writeAOD

writeTAG

Caltech Analysis Workshop


Virtual data toolkit

Generator

Formator

Simulator

Digitiser

writeESD

writeAOD

writeTAG

MC Production Team

POOL

(Re)processing Team

Physics Groups

Analysis

Scripts

Online Teams

Data/workflow

is a collaborative

endeavour!

Calib. DB

writeESD

writeAOD

writeTAG

Caltech Analysis Workshop


Virtual data toolkit

A “Concurrent Analysis Versioning System:”

Complex Data Flow and Data Provenance in HEP

Plots,

Tables,

Fits

AOD

ESD

Raw

TAG

  • Family History of a Data Analysis

  • Collaborative Analysis Development Environment

  • "Check-point" a Data Analysis

  • Analysis Development Environment (like CVS)

  • Audit a Data Analysis

Comparisons

Plots,

Tables,

Fits

Real Data

Simulated

Data

Caltech Analysis Workshop


Virtual data toolkit

Current Prototype GriPhyN “Architecture” (Picture)

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Post talk my wandering mind typical vdt configuration

Post-talk: My wandering mind…Typical VDT Configuration

  • Single public head-node (gatekeeper)

    • VDT-server installed

  • Many private worker-nodes

    • Local scheduler software installed

    • No grid-middleware installed

  • Shared file system (e.g. NFS)

    • User area shared between head-node and worker-nodes

    • One or many raid systems typically shared

Caltech Analysis Workshop


Default middleware configuration from the virtual data toolkit

Default middleware configurationfrom the Virtual Data Toolkit

submit host

remote host

Chimera

gatekeeper

gahp_server

Local Scheduler

(Condor, PBS, etc.)

DAGman

compute

machines

Condor-G

Caltech Analysis Workshop


Edg configuration for comparison

EDG Configuration(for comparison)

  • CPU separate from Storage

    • CE: single gatekeeper for access to cluster

    • SE: single gatekeeper for access to storage

  • Many public worker-nodes (at least NAT)

    • Local scheduler installed (LSF or PBS)

    • Each worker-node runs a GridFTP Client

  • No assumed shared file system

    • Data access is accomplished via globus-url-copy to local disk on worker-node

Caltech Analysis Workshop


Why care

Why Care?

  • Data Analyses would benefit from being fabric independent!

  • But…the devil is (still) in the details!

    • Assumptions in job descriptions/requirements currently lead to direct fabric-level consequences and vice versa.

  • Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)?

    • Really need input from groups like here!

    • What kind of fabric layer is necessary for “interactive” data analysis using PROOF, JAS?

  • Does the VDT need multiple configuration flavors?

    • Production, batch oriented (current default)

    • Analysis, interactive oriented

Caltech Analysis Workshop


  • Login