Virtual data toolkit
Download
1 / 19

Virtual Data Toolkit - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Virtual Data Toolkit. R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003. MCAT; GriPhyN catalogs. MDS. MDS. GDMP. DAGMAN, Condor-G. GSI, CAS. Globus. GRAM. GridFTP; GRAM; SRM. Very Early GriPhyN Data Grid Architecture. Application. = initial solution is operational.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Virtual Data Toolkit

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Virtual Data Toolkit

R. Cavanaugh

GriPhyN

Analysis Workshop

Caltech, June, 2003


MCAT; GriPhyN catalogs

MDS

MDS

GDMP

DAGMAN, Condor-G

GSI, CAS

Globus

GRAM

GridFTP; GRAM; SRM

Very Early GriPhyN Data Grid Architecture

Application

= initial solution is operational

Catalog Services

Monitoring

Planner

Info Services

Repl. Mgmt.

Executor

Policy/Security

Reliable Transfer

Service

Compute Resource

Storage Resource

Caltech Analysis Workshop


Currently Evolved GriPhyN Picture

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Current VDT Emphasis

  • Current reality

    • Easy grid construction

      • Strikes a balance between flexibility and “easibility”

      • purposefully errs (just a little bit) on the side of “easibility”

    • Long running, high-throughput, file-based computing

    • Abstract description of complex workflows

    • Virtual Data Request Planning

    • Partial provenance tracking of workflows

  • Future directions (current research) including:

    • Policy based scheduling

      • With notions of Quality of Service (advanced reservation of resources, etc)

    • Dataset based (arbitrary type structures)

    • Full provenance tracking of workflows

    • Several others…

Caltech Analysis Workshop


Client

Globus Toolkit 2

GSI

globusrun

GridFTP Client

CA signing policies for DOE and EDG

Condor-G 6.5.1 / DAGMan

RLS 1.1.8 Client

MonALISA Client (soon)

Chimera 1.0.3

SDK

Globus

ClassAds

RLS 1.1.8 Client

Netlogger 2.0.13

Server

Globus Toolkit 2.2.4

GSI

Gatekeeper

job-managers and GASS Cache

MDS

GridFTP Server

MyProxy

CA signing policies for DOE and EDG

EDG Certificate Revocation List

Fault Tolerant Shell

GLUE Schema

mkgridmap

Condor 6.5.1 / DAGMan

RLS 1.1.8 Server

MonALISA Server (soon)

Current VDT Flavors

Caltech Analysis Workshop


Chimera Virtual Data System

  • Virtual Data Language

    • textual

    • XML

  • Virtual Data Catalog

    • MySQL or PostGreSQL based

    • File based version available

Caltech Analysis Workshop


Virtual Data Language

file1

TR CMKIN( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

TR CMSIM( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

DV x1->CMKIN( a2=@{out:file2}, a1=@{in:file1});

DV x2->CMSIM( a2=@{out:file3}, a1=@{in:file2});

x1

file2

x2

file3

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Virtual Data Request Planning

  • Abstract Planner

    • Graph traversal of (virtual) data dependencies

    • Generates the graph with maximal data dependencies

    • Somewhat analogous to Build Style

  • Concrete (Pegasus) Planner

    • Prunes execution steps for which data already exists (RLS lookup)

    • Binds all execution steps in the graph to a site

    • Adds “housekeeping” steps

      • Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc

    • Generates a graph with minimal execution steps

    • Somewhat analogous to Make Style

Caltech Analysis Workshop


Chimera Virtual Data System:

Mapping Abstract Workflows onto Concrete Environments

VDL

  • Abstract DAGs (virtual workflow)

    • Resource locations unspecified

    • File names are logical

    • Data destinations unspecified

    • build style

  • Concrete DAGs (stuff for submission)

    • Resource locations determined

    • Physical file names specified

    • Data delivered to and returned from physical

    • locations

    • make style

XML

VDC

XML

Abs. Plan

Logical

DAX

RLS

C. Plan.

DAG

Physical

DAGMan

In general there is a full range of planning steps between abstract workflows and concrete workflows

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Supercomputing 2002

mass = 200

decay = bb

A virtual space

of simulated data

is generated for

future use by

scientists...

mass = 200

mass = 200

decay = ZZ

mass = 200

decay = WW

stability = 3

mass = 200

decay = WW

mass = 200

decay = WW

stability = 1

mass = 200

event = 8

mass = 200

decay = WW

stability = 1

event = 8

mass = 200

plot = 1

mass = 200

decay = WW

event = 8

mass = 200

decay = WW

stability = 1

plot = 1

mass = 200

decay = WW

plot = 1

Caltech Analysis Workshop


Supercomputing 2002

Scientists

may add new

derived data

branches...

mass = 200

decay = bb

mass = 200

mass = 200

decay = ZZ

mass = 200

decay = WW

stability = 3

mass = 200

decay = WW

stability = 1

LowPt = 20

HighPt = 10000

mass = 200

decay = WW

mass = 200

decay = WW

stability = 1

mass = 200

event = 8

mass = 200

decay = WW

stability = 1

event = 8

mass = 200

plot = 1

mass = 200

decay = WW

event = 8

mass = 200

decay = WW

stability = 1

plot = 1

mass = 200

decay = WW

plot = 1

Caltech Analysis Workshop


Generator

Formator

Simulator

Digitiser

writeESD

writeAOD

writeTAG

POOL

Analysis

Scripts

Example

CMS Data/

Workflow

Calib. DB

writeESD

writeAOD

writeTAG

Caltech Analysis Workshop


Generator

Formator

Simulator

Digitiser

writeESD

writeAOD

writeTAG

MC Production Team

POOL

(Re)processing Team

Physics Groups

Analysis

Scripts

Online Teams

Data/workflow

is a collaborative

endeavour!

Calib. DB

writeESD

writeAOD

writeTAG

Caltech Analysis Workshop


A “Concurrent Analysis Versioning System:”

Complex Data Flow and Data Provenance in HEP

Plots,

Tables,

Fits

AOD

ESD

Raw

TAG

  • Family History of a Data Analysis

  • Collaborative Analysis Development Environment

  • "Check-point" a Data Analysis

  • Analysis Development Environment (like CVS)

  • Audit a Data Analysis

Comparisons

Plots,

Tables,

Fits

Real Data

Simulated

Data

Caltech Analysis Workshop


Current Prototype GriPhyN “Architecture” (Picture)

Caltech Analysis Workshop

Picture Taken from Mike Wilde


Post-talk: My wandering mind…Typical VDT Configuration

  • Single public head-node (gatekeeper)

    • VDT-server installed

  • Many private worker-nodes

    • Local scheduler software installed

    • No grid-middleware installed

  • Shared file system (e.g. NFS)

    • User area shared between head-node and worker-nodes

    • One or many raid systems typically shared

Caltech Analysis Workshop


Default middleware configurationfrom the Virtual Data Toolkit

submit host

remote host

Chimera

gatekeeper

gahp_server

Local Scheduler

(Condor, PBS, etc.)

DAGman

compute

machines

Condor-G

Caltech Analysis Workshop


EDG Configuration(for comparison)

  • CPU separate from Storage

    • CE: single gatekeeper for access to cluster

    • SE: single gatekeeper for access to storage

  • Many public worker-nodes (at least NAT)

    • Local scheduler installed (LSF or PBS)

    • Each worker-node runs a GridFTP Client

  • No assumed shared file system

    • Data access is accomplished via globus-url-copy to local disk on worker-node

Caltech Analysis Workshop


Why Care?

  • Data Analyses would benefit from being fabric independent!

  • But…the devil is (still) in the details!

    • Assumptions in job descriptions/requirements currently lead to direct fabric-level consequences and vice versa.

  • Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)?

    • Really need input from groups like here!

    • What kind of fabric layer is necessary for “interactive” data analysis using PROOF, JAS?

  • Does the VDT need multiple configuration flavors?

    • Production, batch oriented (current default)

    • Analysis, interactive oriented

Caltech Analysis Workshop


ad
  • Login