Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environmen...
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments. Edward Walker Chona S. Guiang. Outline. Two applications Zeolite structure search Binding energy calculations Solutions Workflow Submission system Exported file system

Download Presentation

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Challenges in executing large parameter sweep studies across widely distributed computing environments

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments

Edward Walker

Chona S. Guiang



  • Two applications

    • Zeolite structure search

    • Binding energy calculations

  • Solutions

    • Workflow

    • Submission system

    • Exported file system

    • Resources aggregated

    • Example week of work

What are zeolites

What are Zeolites?

  • Crystalline micro-porous material

    • Structures exhibit regular arrays of channels from 0.3 to 1.5nm

    • When channels are filled with water (or other substance), they make excellent molecular sieves for industrial processes and commercial products, e.g. deodorant in cat litters.

    • The acid form also has useful catalytic properties, e.g. ZSM-5 used as a co-catalyst in crude oil refinement.

  • Basic building block is a TO4 tetrahedron

    • T = Si, Al, P, etc.

  • Prior to this study, only 180 structures were known

Scientific goals

Scientific goals

  • Goal 1: Discover as many thermodynamically feasible Zeolite structures as possible.

  • Goal 2: Populate a public database for material scientists to synthesize and experiment with these new structures

Computational methodology

Computational methodology

  • General strategy: Create a potential cell structure and solve its energy function

  • Approach:

    • Group potential cell structures with a similar template structure into space groups (230 groups in total)

    • Each cell structure in the space group is further characterized by the space variables (a,b,c,a,b,g)

    • Solve the multi-variable energy function for each cell structure using simulated annealing

Ligand binding energy calculations

Ligand binding energy calculations

  • binding energy is quantitative measure of ligand affinity to receptor

  • important in docking of ligand to protein

  • ligand energies can be used as basis for scoring ligand-receptor interactions (used in structure-based drug design)

Scientific goals1

Scientific goals

  • Calculate binding energies between trypsin and benzamidine at different values of the force-field parameters

  • Compare calculated binding energy with experimental values

  • Validate force-field parameters based on comparison

  • Apply to different ligand-receptor system

Computational methodology1

Computational methodology

Binding energy is calculated

  • molecular dynamics (MD) simulations of ligand “disappearing” in water

  • MD simulations of ligand extinction in the solvated ligand-protein complex

  • MD calculations were performed with Amber

  • extinction is parameterized by coupling parameter, 

  • each job is characterized by a different  and force-field parameters





E(aq) + S(aq)


Computational usage zeolite search

Computational Usage: zeolite search

  • Ran on TeraGrid

  • Allocated over two million service units

    • one million to Purdue Condor pool

    • one million to all other HPC resources on TeraGrid

Computational usage ligand binding energy calculation

Computational usage: ligand binding energy calculation

  • Running on departmental cluster, TACC Condor cluster and lonestar

  • Each 2.5 ns simulation takes more than two weeks

  • Will require additional CPU time

Challenge 1 hundreds of thousands of simulations need to be run

Challenge 1: Hundreds of thousands of simulations need to be run

  • The energy function for every potential cell structure needs to be solved.

  • Structures with feasible solutions indicate a feasible structure.

  • Many sites have a limit to the number of jobs that can be submitted to a local queue.

Challenge 2 each simulation task is intrinsically serial

Challenge 2: Each simulation task is intrinsically serial

  • Simulated annealing method is intrinsically serial.

  • Each MD simulation (function of  and force-field parameter) is serial and independent.

  • Many TeraGrid sites prioritize parallel jobs

  • There are limited slots for serial jobs.

Challenge 3 wide variability in execution times

Challenge 3: Wide variability in execution times

  • Zeolite search

    • Pseudo-random solution method iterates over 100 seeds.

      • potential run times of 10 minutes to 10 hours

      • Some computation may never complete.

      • It is inefficient to request a CPU for 10 hours since computation may never need it.

    • Computation is re-factored into tasks of up to 2 hours.

Challenge 3 wide variability in execution times1

Challenge 3: Wide variability in execution times

  • Ligand binding energy calculation

    • Each MD simulation calculates dynamics to 2.5 ns.

    • Each 2.5 ns of simulation time takes > two weeks.

    • Convergence is not assured after 2.5 ns.

Workflow zeolite search

Workflow: zeolite search

  • Level 1 is an ensemble of workflows evaluating a space group

    • 230 space groups evaluated

  • Level 2 evaluates a candidate structure

    • 6000 to 30000 structures per space group

    • Main task generates solution

    • Post-processing task checks sanity of result

    • Retries up to 5 times if results are wrong

  • Level 3 solves energy function for candidate structure

    • Chain of 5 sub-tasks

    • Each sub-task computes over 20 seeds, consuming at most 2 hours of compute time

Workflow ligand binding energy calculations

Workflow: ligand binding energy calculations

  • Condor cluster has no maximum run time limit.

  • Lonestar has 24-hr run time limit.

  • MD jobs need to be restarted.

  • Workflow jobs need to be submitted to lonestar.

Challenge 4 application is dynamically linked

Challenge 4: Application is dynamically linked

  • Amber was built with Intel shared libraries.

  • These libraries are not be installed on the backend.

  • Can copy shared libraries to backend, but wasteful of space ($HOME on some systems is limited)

Challenge 5 output file needs to be monitored

Challenge 5: Output file needs to be monitored

  • Some MD simulations do not converge.

  • It is possible to find out non convergence at 2 ns.

  • Terminate jobs that do not converge by 2 ns.

  • No global file system exists on some systems.

Submission system

Submission system

  • Want to run many simple jobs/workflows of serial tasks

    • Condor DAGMan is an excellent tool for this

    • requires a Condor pool

  • How to form a Condor pool from HPC systems?

    • form a virtual cluster managed by Condor using MyCluster

    • submit jobs/workflows to this

Mycluster overview

MyCluster overview

  • Creates a personal virtual cluster for a user

    • from one or from pieces of different systems

  • Schedules user jobs onto this cluster

    • User can pick one of several workload managers

      • Condor, SGE, OpenPBS

      • Condor currently on TeraGrid

    • User submits all their jobs to this workload manager

  • Deployed on TeraGrid


Starting mycluster

Starting MyCluster

  • Log in to a system with MyCluster installed

    • majority of TeraGrid systems

    • can be installed on other systems

  • Execute vo-login to start a session

    • you’re now in a MyCluster shell

1. Create MyCluster

MyCluster Shell


Configuring mycluster

Configuring MyCluster

  • Personal cluster is defined using a user-specified configuration file

    • Identifies which clusters can be part of personal cluster

    • Specifies limits on portion of those clusters to use

  • Personal workload manager is started

    • Condor in this case

2. MyCluster is configured


1. Create MyCluster

MyCluster Shell







Submitting work to mycluster

Submitting Work to MyCluster

  • Jobs submitted to personal workload manager

    • for workflows, DAGMan jobs are submitted that in turn submit individual Condor jobs

    • DAGMan configured to submit at most 380 jobs at a time

  • Personal workload manager manages jobs like for any other cluster

3. User submits DAGMan jobs

2. MyCluster is configured


1. Create MyCluster

MyCluster Shell






Mycluster resource management

MyCluster Resource Management

  • MyCluster submits parallel jobs to clusters

  • These jobs start personal workload manager daemons

    • condor_startd in this case

  • These daemons contact the personal workload manager saying they have resources available

  • MyCluster grows and shrinks the size of its virtual cluster

    • Based on the amount of jobs it’s managing

  • File system on workstation may be mounted on backend


3. User submits DAGMan jobs

2. MyCluster is configured


1. Create MyCluster

MyCluster Shell


4. MyCluster submits and manages WM daemons






5. MyCluster uses XUFS to mount WS file system on remote resources

Example mycluster login session

Example MyCluster login session

% vo-login

Enter GRID passphrase:  GRAM or SSH login

Spawning on

Spawning on

Setting up VO participants ......Done

Welcome to your MyCluster/Condor environment

To shutdown environment, type "gexit"

To detach from environment, type "detach"

mycluster(gtcsh.9676)% condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

32020@compute LINUX INTEL Unclaimed Idle 0.000 2026[?????]

32021@tg-c383 LINUX IA64 Unclaimed Idle 0.000 2026[?????]

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/LINUX 2 0 0 2 0 0

IA64/LINUX 2 0 0 2 0 0

Total 4 0 0 4 0 0

Systems aggregated with mycluster

Systems aggregated with MyCluster

Expanding and shrinking condor cluster created with mycluster 1 week period

Expanding and shrinking Condor cluster created with MyCluster (1 week period)

Running and pending jobs in a personal cluster using mycluster 1 week period

Running and pending jobs in a personal cluster using MyCluster (1 week period)

Project conclusion

Project Conclusion

  • Allocation completely consumed in Jan 2007.

  • Over 3 million new structures have been found.


  • Ligand binding energy calculations are deployed on rodeo and lonestar

  • will be deployed on other TG systems

  • still ongoing...



  • J. R. Boisseau, M. Dahan, E. Roberts, and E. Walker, “TeraGrid User Portal Ensemble Manager: Automatically Provisioning Parameter Sweeps in a Web Browser”

  • E. Walker, D. J. Earl, and M. W. Deem, “How to Run a Million Jobs in Six Months on the NSF TeraGrid”



  • Please contact

  • Login