1 / 29

The GriPhyN Virtual Data System

The GriPhyN Virtual Data System. Ian Foster for the VDS team. Science as “Workflow”: E.g., Galaxy Cluster Search. DAG. Sloan Data. Galaxy cluster size distribution. Jim Annis, Steve Kent, Vijay Sehkri, Fermilab , Michael Milligan, Yong Zhao, University of Chicago.

najila
Download Presentation

The GriPhyN Virtual Data System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The GriPhyNVirtual Data System Ian Foster for the VDS team

  2. Science as “Workflow”:E.g., Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago

  3. Requirements • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement

  4. Virtual Data System • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus VDC TBD

  5. Workflow spec VDL Program Virtual Data catalog Virtual Data Workflow Generator Abstract workflow Virtual Data System Create Execution Plan Grid Workflow Execution Statically Partitioned DAG DAGman DAG DAGman & Condor-G Dynamically Planned DAG Job Planner Job Cleanup Local planner

  6. 600-1000+ CPUs Genome Analysis &DB Update (GADU)

  7. The Rest of the Talk • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus Ewa VDC TBD

  8. “Messy” Scientific Data • Diverse storage formats & access protocols • Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet • Data available from filesystem, database, HTTP, WebDAV, etc... • Metadata encoded in directory & file names • E.g.: “fMRI volume is composed of an image file & header file with same prefix” • Format dependency hinders program and workflow reuse

  9. But... Data is Often Logically Structured • Scientific data often maintain hierarchical structure • A common practice is to select a set of data items and apply a transformation to each individual item • A nested approach of such iterations could scale up to millions of objects

  10. Introducing a Typing System • Describe logical data structures as types … • … & physical representations as mappings • Define procedures in terms of typed datasets • … & apply procedures to different physical data • Compose workflows from typed procedures • Benefits • Type checking • Dataset selection and iteration • Discovery by types • Dynamic binding • Type conversion

  11. XDTM(Moreau, Zhao, Wilde, Foster) • XML Dataset Typing and Mapping • Separates logical structure from physical representations • Logical structure described by XML Schema • Primitive scalar types: int, float, string, date … • Complex types (structs and arrays) • Mapping descriptor • How logical elements map to physical • External parameters (e. g. location) • XPath for dataset selection

  12. Mapping • Define a common mapping interface • Initialize, read, create, write, close • Data providers implement the interface • Responsible for data access details • XView maintains cached logical datasets VDS Mapper Data Source XView VDS XViewMgr Mapper Data Source

  13. Use Case: Functional MRI Logical Structure Physical Representation DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... … Group #5 ... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr ... bold1_275.img bold1_275.hdr ... bold5_001.img ... snrbold*_* air* ... Group_5 ... Study ...

  14. Type Definitions in VDL type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; } Part of fMRI AIRSN (Spatial Normalization) Workflow

  15. Type Definitions in XML Schema <xs:schema targetNamespace="http://www.fmri.org/schema/airsn.xsd" xmlns="http://www.fmri.org/schema/airsn.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:simpleType name="Image“/> <xs:simpleType name="Header“/> <xs:complexType name="Volume"> <xs:sequence> <xs:element name="img" type="Image"/> <xs:element name="hdr" type="Header"/> </xs:sequence> </xs:complexType> <xs:complexType name="Run"> <xs:sequence minOccurs="0 maxOccurs="unbounded"> <xs:element name="v" type="Volume"/> </xs:sequence> </xs:complexType> </xs:schema>

  16. Procedure Definition in VDL • (Run snr) functional( Run r, NormAnat a, Air shrink ) { • Run yroRun = reorientRun( r , "y" ); • Run roRun = reorientRun( yroRun , "x" ); • Volume std = roRun[0]; • Run rndr = random_select( roRun, .1 ); //10% sample • AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); • Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); • Volume meanRand = softmean(reslicedRndr, "y", null ); • Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); • Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ ); • Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); • Run nr = reslice_warp_run( boldNormWarp, roRun ); • Volume meanAll = strictmean ( nr, "y", null ) • Volume boldMask = binarize( meanAll, "y" ); • snr = gsmoothRun( nr, boldMask, 6, 6, 6 ); • }

  17. Dataset Iteration • Functional analysis expressed in typed datasets • Iterate over each volume in a run

  18. Expanded Execution Plan • Datasets dynamically instantiated from data sources by mappers

  19. Functional MRI Execution

  20. Code Size Comparison Lines of code with different workflow encodings

  21. The Rest of the Talk • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus VDC TBD

  22. Virtual Data Schema

  23. fMRI Virtual Data Queries Which transformations can process a “subject image”? • Q: xsearchvdc -q tr_meta dataType subject_image input • A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: • Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young • A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: • Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img • A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

  24. Provenance for ATLAS DC2(High Energy Physics) How much compute time was delivered? | years| mon | year | +------+------+------+ | .45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556 ... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel?

  25. LIGO Inspiral Search Application • Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group

  26. Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N FOAM:Fast Ocean/Atmosphere Model250-Member EnsembleRun on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Atmos Postprocessing for Ensemble Member 2 Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution)

  27. FOAM and VDS 160 ensemble members in 75 days Climate Supercomputer andGrad student 250 ensemble members in 4 days TeraGrid and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison

  28. Summary:Science as Workflow Executed Executing Query Executable Not yet executable What I Did What I Am Doing Edit … What I Want to Do Execution environment Schedule

  29. Acknowledgements • The Virtual Data System group is: • ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi • U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao • GriPhyN is supported by the NSF • Many research efforts involved in this work are supported by the US Department of Energy, Office of Science

More Related