Swift implicitly parallel dataflow scripting for clusters clouds and multicore systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems. www.ci.uchicago.edu /swift Michael Wilde - [email protected] When do you need Swift? Typical application: protein-ligand docking for drug screening. O(10) proteins implicated in a disease. O (100K)

Download Presentation

Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Swift implicitly parallel dataflow scripting for clusters clouds and multicore systems

Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems

www.ci.uchicago.edu/swift

Michael Wilde - [email protected]


When do you need swift typical application protein ligand docking for drug screening

When do you need Swift?Typical application: protein-ligand docking for drug screening

O(10)proteinsimplicatedin a disease

O(100K)

drug

candidates

X

1M

compute

jobs

Tens of fruitful

candidates for wetlab & APS

(B)

Work of M. Kubal,T.A.Binkowski,

And B. Roux

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Numerous many task applications

Simulation of super-cooled glass materials

Protein folding using homology-free approaches

Climate model analysis and decision making in energy policy

Simulation of RNA-protein interaction

Multiscale subsurfaceflow modeling

Modeling of power gridfor OE applications

A-E have published science results obtained using Swift

Numerous many-task applications

A

B

C

A

B

D

T0623, 25 res., 8.2Å to 6.3Å (excluding tail)

E

F

E

F

Initial

>

Protein loop modeling. Courtesy A. Adhikari

D

C

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm

Predicted

Native


Language driven swift parallel scripting

Language-driven: Swift parallel scripting

1018

Data server

1015

Submithost (login node, laptop, Linux server)

Swift

script

Clouds:

Amazon EC2, XSEDE Wispy, …

Application

Programs

Swift runs parallel scripts on a broad rangeof parallel computing resources.

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Example modis satellite image processing

Example – MODIS satellite image processing

  • Ouput: regions with maximalspecific landtypes

MODIS

dataset

MODIS analysis script

5 largest forest land-cover tiles in processed region

Input: tiles of earth land cover (forest, ice, water, urban, etc)

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Goal run modis processing pipeline in cloud

Goal: Run MODIS processing pipeline in cloud

getLandUse

x 317

analyzeLandUse

markMap

colorMODIS

assemble

MODIS script is automatically run in parallel:

Each loop levelcan process tens

to thousands ofimage files.

getLandUse

x 317

colorMODIS

x 317

analyzeLandUse

assemble

markMap

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Modis script in swift main data flow

MODIS script in Swift: main data flow

foreachg,i in geos {

land[i] = getLandUse(g,1);

}

(topSelected, selectedTiles) =

analyzeLandUse(land, landType, nSelect);

foreachg, i in geos {

colorImage[i] = colorMODIS(g);

}

gridMap = markMap(topSelected);

montage =

assemble(selectedTiles,colorImage,webDir);

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Parallel distributed scripting for clouds

Parallel distributed scripting for clouds

Data server

Submithost (login node, laptop, Linux server)

Clouds:

Amazon EC2,NSF FutureGrid, Wispy, …

Swift

script

Nimbus,Phantom

Swift runs parallel scripts on cloud resources provisioned by Nimbus’s Phantom service.

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Example of cloud programming nimbus phantom swift on futuregrid

Example of Cloud programming:Nimbus-Phantom-Swift on FutureGrid

  • User provisions 5 nodes with Phantom

    • Phantom starts 5 VMs

    • Swift worker agents in VMs contact Swift coaster service to request work

  • Start Swift application script “MODIS”

    • Swift places application jobs on free workers

    • Workers pull input data, run app, push output data

  • 3 nodes fail and shut down

    • Jobs in progress fail, Swift retries

  • User can add more nodes with phantom

    • User asks Phantom to increase node allocation to 12

    • Swift worker agents register, pick up new workers, runs more in parallel

  • Workload completes

    • Science results are available on output data server

    • Worker infrastructure is available for new workloads

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Programming model all execution driven by parallel data flow

Programming model:all execution driven by parallel data flow

  • f() and g() are computed in parallel

  • myproc() returns r when they are done

  • This parallelism is automatic

  • Works recursively throughout the program’s call graph

(intr) myproc (inti)

{

j = f(i);

k = g(i);

r = j + k;

}

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Encapsulation enables distributed parallelism

Encapsulation enables distributed parallelism

Swift app( ) function

Interface definition

Typed Swiftdata object

Files expectedor producedby application program

Application program

Encapsulation is the key to transparent distribution, parallelization, and automatic provenance capture

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


A pp functions specify cmd line argument passing

app( ) functions specify cmd line argument passing

seq

t

dt

Swift app function

“predict()”

To run:

psim -s1ubq.fas -pdbp -t100.0 -d25.0 >log

In Swift code:

app (PDB pg, File log) predict (Protein seq,

Floatt, Float dt)

{

psim"-c" "-s" @pseq.fasta "-pdb" @pg

"–t" temp "-d" dt;

}

Protein p <ext; exec="Pmap", id="1ubq">;

PDB structure;

File log;

(structure, log) = predict(p, 100., 25.);

Fasta

file

25.0

100.0

pg

log

-t

-c

-s

-d

PSim application

pdb

>

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Large scale parallelization with simple loops

Large scale parallelization with simple loops

1000

Runs of the

“predict”

application

Analyze()

T1af7

foreachsim in [1:1000] {

(structure[sim], log[sim]) = predict(p, 100., 25.);

}

result = analyze(structure)

T1b72

T1r69

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Nested parallel prediction loops in swift

Nested parallel prediction loops in Swift

10 proteins x 1000 simulations x

3 rounds x 2 temps x 5 deltas

= 300K tasks

Sweep( )

{

intnSim = 1000;

intmaxRounds = 3;

Protein pSet[ ] <ext; exec="Protein.map">;

float startTemp[ ] = [ 100.0, 200.0 ];

float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ];

foreachp, pn in pSet {

foreacht in startTemp {

foreachd in delT {

ItFix(p, nSim, maxRounds, t, d);

}

}

}

}

Sweep();

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Spatial normalization of functional run

Spatial normalization of functional run

Dataset-level workflow

Expanded (10 volume) workflow

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Swift implicitly parallel dataflow scripting for clusters clouds and multicore systems

Complex scripts can be well-structuredprogramming in the large: fMRI spatial normalization script example

(Run or) reorientRun( Run ir, string direction)

{

foreachVolume iv, i in ir.v {

or.v[i] = reorient(iv, direction);

}

}

  • (Run snr) functional ( Run r, NormAnat a, Air shrink )

  • {Run yroRun = reorientRun( r , "y" );

    • Run roRun = reorientRun( yroRun , "x" );

    • Volume std = roRun[0];

    • Run rndr = random_select( roRun, 0.1 );

    • AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );

    • Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );

    • Volume meanRand = softmean( reslicedRndr, "y", "null" );

    • Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );

    • Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );

    • Run nr = reslice_warp_run( boldNormWarp, roRun );

    • Volume meanAll = strictmean( nr, "y", "null" )

    • Volume boldMask = binarize( meanAll, "y" );

    • snr = gsmoothRun( nr, boldMask, "6 6 6" );

  • }

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Dataset mapping example fmri datasets

Dataset mapping example: fMRI datasets

type Study {

Group g[ ];

}

type Group {

Subject s[ ];

}

type Subject {

Volume anat;

Run run[ ];

}

type Run {

Volume v[ ];

}

type Volume {

Image img;

Header hdr;

}

On-Disk

Data

Layout

Swift’s

in-memory data model

Mapping function

or script

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Swift can send work to multiple sites

Swift can send work to multiple sites

Remote campus cluster

Remote campus cluster

GRAM

GRAM

Local

Scheduler

Local

Scheduler

Any remote host

. . .

Condor

Or GRAM

Provider

Remote

Data

Provider

Data server

(GridFTP, scp, …)

Simple campus Swift usage: running locally on a cluster login host

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Flexible worker node agents for execution and data transport

Flexible worker-node agents for execution and data transport

  • Main role is to efficiently run Swift tasks on allocated compute nodes, local and remote

  • Handles short tasks efficiently

  • Runs over Grid infrastructure: Condor, GRAM

  • Also runs with few infrastructure dependencies

  • Can optionally perform data staging

  • Provisioning can be automatic or external (manually launched or externally scripted)

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Coaster architecture handles diverse environments

Coaster architecture handles diverse environments

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Coasters deployment can be automatic or manual

Coasters deployment can be automatic or manual

Centralized campus

cluster environment

GRAM

or SSH

Any remote host

Compute

nodes

Remote

Host

Swift

script

Coaster

worker

Coaster

Service

Coaster

Auto

Boot

Data

Provider

. . .

Coaster

worker

f1

Cluster

Scheduler

a1

Data server

Swift can deploy coasters automatically, or user can deploy manually or with a script. Swift provides scripts for clouds, ad-hoc clusters, etc.

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Swift implicitly parallel dataflow scripting for clusters clouds and multicore systems

  • Swift is a parallel scripting system for grids, clouds and clusters

    • for loosely-coupled applications - application and utility programs linked by exchanging files

  • Swift is easy to write: simple high-level C-like functional language

    • Small Swift scripts can do large-scale work

  • Swift is easy to run: contains all services for running Grid workflow - in one Java application

    • Untar and run – acts as a self-contained Grid client

  • Swift is fast: uses efficient, scalable and flexible “Karajan” execution engine.

    • Scaling close to 1M tasks – .5M in live science work, and growing

  • Swift usage is growing:

    • applications in neuroscience, proteomics, molecular dynamics, biochemistry, economics, statistics, and more.

  • Try Swift! www.ci.uchicago.edu/swift and www.mcs.anl.gov/exm

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Swift implicitly parallel dataflow scripting for clusters clouds and multicore systems

Parallel Computing, Sep 2011

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


Acknowledgments

Acknowledgments

  • Swift is supported in part by NSF grants OCI-1148443 and PHY-636265, NIH DC08638,and the UChicago SCI Program

  • ExM is supported by the DOE Office of Science, ASCR Division

  • Structure prediction supported in part by NIH

  • The Swift team:

    • MihaelHategan, Justin Wozniak, KetanMaheshwari, Ben Clifford, David Kelly, Jon Monette, Allan Espinosa, Ian Foster, Dan Katz, IoanRaicu, Sarah Kenny, Mike Wilde, Zhao Zhang, Yong Zhao

  • GPSI Science portal:

    • Mark Hereld, Tom Uram, Wenjun Wu, Mike Papka

  • Java CoG Kit used by Swift developed by:

    • MihaelHategan, Gregor Von Laszewski, and many collaborators

  • ZeptoOS

    • KamilIskra, Kazutomo Yoshii, and Pete Beckman

  • Scientific application collaborators and usage described in this talk:

    • U. Chicago Open Protein Simulator Group (Karl Freed, Tobin Sosnick, Glen Hocky, Joe Debartolo, AashishAdhikari)

    • U.Chicago Radiology and Human Neuroscience Lab, (Dr. S. Small)

    • SEE/CIM-EARTH: Joshua Elliott, Meredith Franklin, Todd Muson

    • ParVis and FOAM: Rob Jacob, Sheri Mickelson (Argonne); John Dennis, Matthew Woitaszek (NCAR)

    • PTMap: Yingming Zhao, Yue Chen

www.ci.uchicago.edu/swift www.mcs.anl.gov/exm


  • Login