1 / 17

Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics

Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June 17, 2006. Outline. Computing intensive sequence alignment Mapping the problem to the Grid Prototype implementation & results.

toril
Download Presentation

Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June 17, 2006

  2. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Outline • Computing intensive sequence alignment • Mapping the problem to the Grid • Prototype implementation & results

  3. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Introduction • Sequence alignment based on profile-HMM: popular but CPU intensive • The problem can easily be parallelised: • Embarrassingly parallel problem domain • Ideal for a Data Grid with lots of CPU power

  4. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM in Biology • Originally, HMMs have been mainly used in speach recognition • In Biology used for sequence alignment and database search • Packages like • HMMER, PFTOOLS, SAM, etc. • Profile-HMMs are stored in databases like Prosite, PFAM, SMART, etc.

  5. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM on the Grid hmmsearch model.hmm database.seq Example: • Gridification of the scenario: • Input dataset needs to be split (pre-processing) • Workload generation • Grid jobs submission • Remote execution (CPU intensive part) • Merging of results

  6. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM on the Grid Grid Storage Element (SE) Local Desktop • Data pre-processing • Creation of job descriptors • Job submission • -> GRID • Merging of results 4. Remote execution on computing elements (CE)

  7. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Chunks Input Files Store files on SE Profiles 1..100 Profiles 1..nnn Profiles 101..200 Grid Storage Element (SE) … Profiles kkk..nnn Seqs 1..zzzzz Seqs 1..10’000 Seqs 10’001..20’000 … Get files from SE Seqs yyyyy..zzzzz wg on Local Site Seqs 1..10’000 Profiles 1..100 Profile 1 Profile 2 Profile 3 hmmsearch … Profile 100 wg on Remote Site

  8. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Prototype Implementation • Prototype code implemented in C++ to wrap hmmseach as well as pfsearch • Other applications are possible, too • Client side code: • Uses EGEE as the main Grid middleware • Within Swiss BioGrid: ARC/NorduGrid • The tool also runs on LSF (Vital-IT cluster) • Globus needs to be installed on local machine • Takes care of job creation, remote execution, resubmission etc.

  9. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Usage Example for “wg”(workload generator) Usage: wg <hmmFile> <seqFile> [options] <hmmFile> path of the profile-HMM file containing 1 or more profiles. <seqFile> path of the sequence file containing 1 or more sequences. The following options are available: -p <no> number of parallel jobs submitted to the Grid. -r used for a remote execution of the program. -h print this help message. -s retrieve all status information of submitted jobs. -v verbose: print debug messages. -O do not retrieve the job output but only display status. -P use pfsearch (default). -H use hmmsearch.

  10. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Benchmarks • Several benchmarks to check for: • Functionality (correctness) • Performance • Performance “prediction” • Preliminary benchmark dataset: • Profile-HMM DB with 7,868 entries (619 MB) • Sequence DB with 10,923 entries ( 7.4 MB)

  11. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Preliminary Results Execution time in hours hmmsearch Number of processors (parallel jobs)

  12. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Distribution of Job Execution Time 100 parallel jobs Not really “high performant” Need to get rid of peaks

  13. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Run Time Sensitive (RTS) Scheduling/Execution Task Server Storage Element task 1, task 2, ... , task n 2. return task URL Task done 1. Get Task 3. retrieve task Worker Node with running job

  14. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Execution time of RTS-Algo. Time in hours 0 100 200 300 400 Sequence number of work unit Overall performance: 2.5 hours Similar to performance on local cluster

  15. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Execution time on LSF 3hours 2hours 1hour

  16. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Comparison

  17. Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Conclusion • Gridification works well for the selected problem • High performance is achieved via run time sensitive scheduling algorithm • Heterogeneous Grids can gain comparable performance to homogeneous clusters Work done by Heinz Stockinger in co-operation with Marco Pagni, Lorenzo Cerutti and Laurent Falquet The EMBRACE project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2004-512092.

More Related