Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environmen...
Download
1 / 30

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments. Edward Walker Chona S. Guiang. Outline. Two applications Zeolite structure search Binding energy calculations Solutions Workflow Submission system Exported file system

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments' - macaria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

Edward Walker

Chona S. Guiang


Outline
Outline Widely Distributed Computing Environments

  • Two applications

    • Zeolite structure search

    • Binding energy calculations

  • Solutions

    • Workflow

    • Submission system

    • Exported file system

    • Resources aggregated

    • Example week of work


What are zeolites
What are Zeolites? Widely Distributed Computing Environments

  • Crystalline micro-porous material

    • Structures exhibit regular arrays of channels from 0.3 to 1.5nm

    • When channels are filled with water (or other substance), they make excellent molecular sieves for industrial processes and commercial products, e.g. deodorant in cat litters.

    • The acid form also has useful catalytic properties, e.g. ZSM-5 used as a co-catalyst in crude oil refinement.

  • Basic building block is a TO4 tetrahedron

    • T = Si, Al, P, etc.

  • Prior to this study, only 180 structures were known


Scientific goals
Scientific goals Widely Distributed Computing Environments

  • Goal 1: Discover as many thermodynamically feasible Zeolite structures as possible.

  • Goal 2: Populate a public database for material scientists to synthesize and experiment with these new structures


Computational methodology
Computational methodology Widely Distributed Computing Environments

  • General strategy: Create a potential cell structure and solve its energy function

  • Approach:

    • Group potential cell structures with a similar template structure into space groups (230 groups in total)

    • Each cell structure in the space group is further characterized by the space variables (a,b,c,a,b,g)

    • Solve the multi-variable energy function for each cell structure using simulated annealing


Ligand binding energy calculations
Ligand binding energy calculations Widely Distributed Computing Environments

  • binding energy is quantitative measure of ligand affinity to receptor

  • important in docking of ligand to protein

  • ligand energies can be used as basis for scoring ligand-receptor interactions (used in structure-based drug design)


Scientific goals1
Scientific goals Widely Distributed Computing Environments

  • Calculate binding energies between trypsin and benzamidine at different values of the force-field parameters

  • Compare calculated binding energy with experimental values

  • Validate force-field parameters based on comparison

  • Apply to different ligand-receptor system


Computational methodology1
Computational methodology Widely Distributed Computing Environments

Binding energy is calculated

  • molecular dynamics (MD) simulations of ligand “disappearing” in water

  • MD simulations of ligand extinction in the solvated ligand-protein complex

  • MD calculations were performed with Amber

  • extinction is parameterized by coupling parameter, 

  • each job is characterized by a different  and force-field parameters

E(aq)

E-S(aq)

S(aq)

0(aq)

E(aq) + S(aq)

E-S(aq)


Computational usage zeolite search
Computational Usage: zeolite search Widely Distributed Computing Environments

  • Ran on TeraGrid

  • Allocated over two million service units

    • one million to Purdue Condor pool

    • one million to all other HPC resources on TeraGrid


Computational usage ligand binding energy calculation
Computational usage: ligand binding energy calculation Widely Distributed Computing Environments

  • Running on departmental cluster, TACC Condor cluster and lonestar

  • Each 2.5 ns simulation takes more than two weeks

  • Will require additional CPU time


Challenge 1 hundreds of thousands of simulations need to be run
Challenge 1: Hundreds of thousands of simulations need to be run

  • The energy function for every potential cell structure needs to be solved.

  • Structures with feasible solutions indicate a feasible structure.

  • Many sites have a limit to the number of jobs that can be submitted to a local queue.


Challenge 2 each simulation task is intrinsically serial
Challenge 2: Each simulation task is intrinsically serial run

  • Simulated annealing method is intrinsically serial.

  • Each MD simulation (function of  and force-field parameter) is serial and independent.

  • Many TeraGrid sites prioritize parallel jobs

  • There are limited slots for serial jobs.


Challenge 3 wide variability in execution times
Challenge 3: Wide variability in execution times run

  • Zeolite search

    • Pseudo-random solution method iterates over 100 seeds.

      • potential run times of 10 minutes to 10 hours

      • Some computation may never complete.

      • It is inefficient to request a CPU for 10 hours since computation may never need it.

    • Computation is re-factored into tasks of up to 2 hours.


Challenge 3 wide variability in execution times1
Challenge 3: Wide variability in execution times run

  • Ligand binding energy calculation

    • Each MD simulation calculates dynamics to 2.5 ns.

    • Each 2.5 ns of simulation time takes > two weeks.

    • Convergence is not assured after 2.5 ns.


Workflow zeolite search
Workflow: zeolite search run

  • Level 1 is an ensemble of workflows evaluating a space group

    • 230 space groups evaluated

  • Level 2 evaluates a candidate structure

    • 6000 to 30000 structures per space group

    • Main task generates solution

    • Post-processing task checks sanity of result

    • Retries up to 5 times if results are wrong

  • Level 3 solves energy function for candidate structure

    • Chain of 5 sub-tasks

    • Each sub-task computes over 20 seeds, consuming at most 2 hours of compute time


Workflow ligand binding energy calculations
Workflow: ligand binding energy calculations run

  • Condor cluster has no maximum run time limit.

  • Lonestar has 24-hr run time limit.

  • MD jobs need to be restarted.

  • Workflow jobs need to be submitted to lonestar.


Challenge 4 application is dynamically linked
Challenge 4: Application is dynamically linked run

  • Amber was built with Intel shared libraries.

  • These libraries are not be installed on the backend.

  • Can copy shared libraries to backend, but wasteful of space ($HOME on some systems is limited)


Challenge 5 output file needs to be monitored
Challenge 5: Output file needs to be monitored run

  • Some MD simulations do not converge.

  • It is possible to find out non convergence at 2 ns.

  • Terminate jobs that do not converge by 2 ns.

  • No global file system exists on some systems.


Submission system
Submission system run

  • Want to run many simple jobs/workflows of serial tasks

    • Condor DAGMan is an excellent tool for this

    • requires a Condor pool

  • How to form a Condor pool from HPC systems?

    • form a virtual cluster managed by Condor using MyCluster

    • submit jobs/workflows to this


Mycluster overview
MyCluster overview run

  • Creates a personal virtual cluster for a user

    • from one or from pieces of different systems

  • Schedules user jobs onto this cluster

    • User can pick one of several workload managers

      • Condor, SGE, OpenPBS

      • Condor currently on TeraGrid

    • User submits all their jobs to this workload manager

  • Deployed on TeraGrid

    • http://www.teragrid.org/userinfo/jobs/mycluster.php


Starting mycluster
Starting MyCluster run

  • Log in to a system with MyCluster installed

    • majority of TeraGrid systems

    • can be installed on other systems

  • Execute vo-login to start a session

    • you’re now in a MyCluster shell

1. Create MyCluster

MyCluster Shell

Workstation


Configuring mycluster
Configuring MyCluster run

  • Personal cluster is defined using a user-specified configuration file

    • Identifies which clusters can be part of personal cluster

    • Specifies limits on portion of those clusters to use

  • Personal workload manager is started

    • Condor in this case

2. MyCluster is configured

LSF

1. Create MyCluster

MyCluster Shell

Cluster

Condor

PBS

condor_scheddcondor_collectorcondor_negotiator

Workstation

Cluster


Submitting work to mycluster
Submitting Work to MyCluster run

  • Jobs submitted to personal workload manager

    • for workflows, DAGMan jobs are submitted that in turn submit individual Condor jobs

    • DAGMan configured to submit at most 380 jobs at a time

  • Personal workload manager manages jobs like for any other cluster

3. User submits DAGMan jobs

2. MyCluster is configured

LSF

1. Create MyCluster

MyCluster Shell

Cluster

Condor

PBS

Workstation

Cluster


Mycluster resource management
MyCluster Resource Management run

  • MyCluster submits parallel jobs to clusters

  • These jobs start personal workload manager daemons

    • condor_startd in this case

  • These daemons contact the personal workload manager saying they have resources available

  • MyCluster grows and shrinks the size of its virtual cluster

    • Based on the amount of jobs it’s managing

  • File system on workstation may be mounted on backend

condor_startd

3. User submits DAGMan jobs

2. MyCluster is configured

LSF

1. Create MyCluster

MyCluster Shell

Cluster

4. MyCluster submits and manages WM daemons

Condor

PBS

Workstation

XUFS

Cluster

5. MyCluster uses XUFS to mount WS file system on remote resources


Example mycluster login session
Example MyCluster login session run

% vo-login

Enter GRID passphrase:  GRAM or SSH login

Spawning on lonestar.tacc.utexas.edu

Spawning on tg-login2.ncsa.teragrid.org

Setting up VO participants ......Done

Welcome to your MyCluster/Condor environment

To shutdown environment, type "gexit"

To detach from environment, type "detach"

mycluster(gtcsh.9676)% condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

[email protected] LINUX INTEL Unclaimed Idle 0.000 2026[?????]

[email protected] LINUX IA64 Unclaimed Idle 0.000 2026[?????]

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/LINUX 2 0 0 2 0 0

IA64/LINUX 2 0 0 2 0 0

Total 4 0 0 4 0 0





Project conclusion
Project Conclusion MyCluster (1 week period)

  • Allocation completely consumed in Jan 2007.

  • Over 3 million new structures have been found.

    • http://www.hypotheticalzeolites.net/DATABASE/DEEM/index.php

  • Ligand binding energy calculations are deployed on rodeo and lonestar

  • will be deployed on other TG systems

  • still ongoing...


References
References MyCluster (1 week period)

  • J. R. Boisseau, M. Dahan, E. Roberts, and E. Walker, “TeraGrid User Portal Ensemble Manager: Automatically Provisioning Parameter Sweeps in a Web Browser”

  • E. Walker, D. J. Earl, and M. W. Deem, “How to Run a Million Jobs in Six Months on the NSF TeraGrid”

  • http://www.usenix.org/events/worlds06/tech/prelim_papers/walker/walker.pdf

  • http://www.tacc.utexas.edu/services/userguides/mycluster/

  • Please contact [email protected]


ad