1 / 16

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA

Memory Efficient Pairwise Genome Alignment Algorithm – A Small-Scale Application with Grid Potential. Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC. Introduction.

Download Presentation

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Efficient Pairwise Genome Alignment Algorithm – A Small-Scale Application with Grid Potential Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC

  2. Introduction • Small scale application is studied in the grid environment • Performances are compared with shared memory environment, grid environment and cluster environment • Pairwise sequence alignment program is chosen as a small scale application • The basic algorithm is modified to a memory efficient algorithm • The parallel implementation for pairwise sequence alignment is studied in different environments • Based on work done by Nova Ahmed, NMI Integration Testbed

  3. Specification of the Distributed Environments • Shared Memory environment is a SGI ORIGIN 2000 machine with 24 CPUs • Cluster environment at UAB was a beowulf cluster with 8 homogenous nodes, each node with four 550 MHz Pentium III processors with 512 MB of RAM • Grid environment is the same beowulf cluster of the cluster environment with the Globus Toolkit software layer over it. • Summer 2005 USC HPC resources used

  4. Sequence X G A G A A G A G A C 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 A 0 0 1 0 1 2 A Sequence Y 0 A G 0 0 A 0 A The Basic Pairwise Sequence Alignment Algorithm • Two dimensional array - Similarity Matrix - stores the two sequences • A match or a mismatch is calculated for each position in the pair of sequences to be matched • Dynamic programming is used

  5. P1 P2 P3 P4 P5 Time Part being computed Computation completed Pi sends Edge value to Pi+1 The Reduced Memory Algorithm • Keep only nonzero elements of the matrix • Memory dynamically allocated as required • New data structure for efficiency The Parallel Method • The genome sequences are divided among processors • The Similarity Matrix is divided among processors

  6. Results Computation time: Cluster, Grid-enabled Cluster environment Computation time: Shared Memory, Cluster, Grid-enabled Cluster environment

  7. Results Comparison of speed up: Shared Memory, Cluster, and Grid-enabled Cluster environment Comparison of speed up: Cluster, and Grid-enabled Cluster environment

  8. UAB multi-cluster (a) Computation time (b) Speedup Comparison of multi-Cluster Grid environments

  9. Running Example 04.08.2004 (per Nova Ahmed, UAB Beowulf Cluster: Medusa) Here the steps of running the genome alignment program for grid. First the sample program which aligns a very small genome sequence is tested. The genome sequences were t1.txt, t2.txt The object file is: ar7

  10. Grid-proxy-init, RSL script, globusrun 1. First the grid-proxy-init is run to get the grid certificate Your identity: /O=Grid/OU=UAB Grid/CN=Nova Ahmed Enter GRID pass phrase for this identity: Creating proxy ....................................................... Done Your proxy is valid until: Fri Apr 9 00:54:24 2004 2. Then create the RSL script in genome.rsl to run the job & (count=4) (executable=/home/nova/ar7) (jobtype=mpi) 3. the actual program ran on the grid using globus run command globusrun -s -r medusa.lab.ac.uab.edu -f ./genome.rsl

  11. Output NOVA1 MyId = 3 NumProc = 4 [3 : 0 ->11 1] [3 : 0 ->21 1] [3 : 1 ->2 2] [3 : 2 ->11 1] [3 : 2 ->31 1] [3 : 3 ->1 1] [3 : 4 ->1 1] [3 : 4 ->12 2] [3 : 4 ->21 1] [3 : 5 ->2 2] [3 : 5 ->12 2] [3 : 5 ->23 3] [3 : 5 ->31 1] myid = 3 finished NOVA1 MyId = 0 NumProc = 4 tgatggaggt gatagg [0 : 0 ->11 1] [0 : 2 ->1 1] [0 : 4 ->11 1] [0 : 5 ->11 1] Elapsed time is =0.014624 myid = 0 finished //---------------------- Running the program using longer genome sequences a1-1000, a1-2000, a1-3000 compared with a2-1000, a2-2000, a2-3000 Output ------------------------------------ NOVA1 MyId = 1 NumProc = 4 [1 : 1 ->2 2] [1 : 2 ->13 3] [1 : 3 ->1 1] [1 : 3 ->11 1] myid = 1 finished NOVA1 MyId = 2 NumProc = 4 [2 : 0 ->1 1] [2 : 0 ->11 1] [2 : 2 ->1 1] [2 : 3 ->2 2] [2 : 4 ->2 2] [2 : 4 ->13 3] [2 : 5 ->1 1] [2 : 5 ->13 3] myid = 2 finished

  12. USC HPC – Summer 2005 (a) for small set sequences (b) for long set sequences Computation time in Cluster and Grid environment varying number of processors

  13. USC HPC – Summer 2005 (a) for small set sequences (b) for long set sequences Speed up in the Cluster and Grid environments

  14. Conclusion • Grid environment shows similar performance to cluster environment • Grid environment adds little overhead • Shared memory environment has better speedup performance compared to cluster and grid • Shared memory environment shows the limitation of memory for computing large genome sequences • Small scale applications (as well as large scale) can run efficiently on a grid • Distributed applications with minimal communication among the processors will see benefit in a grid environment – perhaps even across multiple clusters

  15. Future Work • Additional work in a SURAgrid environment that includes multiple clusters • Test data that provides a more computation intensive challenge for grid environments • Adapt the application to the grid environment such that is is using less inter-process communication

  16. Acknowledgements • This material is based in part upon work supported by: • National Science Foundation under Grant No. ANI-0123937 - NMI Integration Testbed Program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) • SURA Grant SURA-2005-305 - SURAgrid Application Development & Documentation • Thanks to • Nova Ahmed, currently Georgia Tech Computer Science PhD program, for original work carried out as part of NMI Integration Testbed Program • John-Paul Robinson and University of Alabama at Birmingham for access to medusa cluster • Jim Cotillier, Shelley Henderson, University of Southern California, for access to HPC resources • Chao “Bill” Xie, Georgia State Computer Science PhD program, for continuing Nova Ahmed’s work • Victor Bolet, Georgia State Information Systems & Technology Advanced Campus Services unit, for support of Georgia State’s SURAgrid nodes • John McGee, RENCI.org, for discussions of approach using globus

More Related