HSP-HMMER vs MPI-HMMER & parallelization of PSI-BLAST

HSP-HMMER vs MPI-HMMER& parallelization of PSI-BLAST Christian Halloyhalloy@jics.utk.edu April 23, 2009

HMMER, PFAM, and PSI-BLAST • If you BLAST a protein sequence (or a translated nucleotide sequence) BLAST will look for known domains in the query sequence. • BLAST can also be used to map annotations from one organism to another or look for common genes in two related species. • HmmerPfam compares one or more sequences to a database of profile hidden Markov models, such as the Pfam library, in order to identify known domains within the sequences. (Fig above): A part of an alignment for the Globin family from the Pfam website HMMER’s hmmpfam code searches an HMM-PFAM database for matches to a query sequence. (Left Figure): molecular rendering of Luciferase protein

preliminaries • first meeting May 15, 08 - need to develop Computational Biology using HPC at ORNL - Jouline’s group uses over 20 common Comp.B. tools • initial approach: - need to install at least two major tools to run on Cray XT4 -HMMER and PSI-BLAST • status in May 08: -public MPI-HMMER scales ok only up to ~256 cores -public (serial) PSI-BLAST was never run on Jaguar/Kraken

Jaguar / Kraken - Cray XT4  XT5 • 31328 Cores / 18048 Cores 150152 Cores / 66048 Cores • 263 TF / 166 TF 1381 TF / 607 TF • 2 GB Mem / 1 GB Mem • Quad core AMD Opteron nodes • Compute nodes + Service nodes (login, I/O)

MPI-HMMER • HMMER- Hidden Markov Model – program by HHMI group at Washington University School of Medicine • MPI-HMMER – Wayne State University and SUNY Buffalo

Analyzing MPI-HMMER – should we try to improve it? 97 programs written in C ~48,000 lines of code 508 MPI_function callssuch as MPI_Init, MPI_Bcast, MPI_Request, MPI_BarrierMPI_Send, MPI_Recv, MPI_Pack, MPI_Unpack, MPI_Wait, … Master – Workers paradigm a LOT of I/O going on Answer: NO! MPI-HMMER (cont’d)

Highly Scalable Parallel (HSP) - HMMER Use the serial HMMER code, split the data, launch with MPI HSP-HMMER Input data – thousands of protein sequences data1 data2 data3 dataN HMMER HMMER HMMER HMMER … result1 result2 result3 resultN

A few details The new HSP-HMMER code uses: only 4 MPI_function calls ! ( MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize) it adds only ~100 lines of code to hmmpfam.c However… Although initial performance was better than MPI-HMMER, the scaling leveled off (but did not decrease) at ~1000 cores Intense simultaneous I/O to the Lustre file system was still creating too much slowdowns. HSP-HMMER (cont’d – 1)

Problems: ‘feeding’ similar lengths of protein sequences to all nodes produces a synchronized effect, thus I/O bottlenecks Improvements: Reorganize the input data so that a mixture of protein sequences of different lengths are given to each processorThis ensures a randomization of the I/O activity, minimizes bottlenecks. Performance gained: another 3 or 4 x HSP-HMMER (cont’d – 2)

HSP-HMMER (cont’d – 3) Problems: Opening a single file (for reads or writes) from 1000 or more processors overloads the MDS – MetaData Server Reads/writes of many different files from many different processors overloads the default 4 OSTs (Object Storage Targets) Solutions: Subdivide total input data into multiple data files Give more work to each processor (more sequences) and write outputs using the Lustre striping mechanism -distribute I/O activities among more OSTs (Object Storage Targets), but have each processor contact only one OST -use pthreads to improve utilization of multicores, more memory, and at the same time reducing number of I/O requests. Another gain of 2 to 3x was observed.

Results: Identifying the Pfam domains in all 6.5 million proteins of the “nr” (non redundant) database takes less than 24 hours when using HSP-HMMER on 2048 dual threaded processors. This would have taken ~2 months with MPI-HMMER This is critical, considering that the protein database is doubling in size every 6 months! HSP-HMMER (cont’d – 4)

SUMMARY: Highly Scalable Parallel HMMER on a Cray XT4 C. Halloy, B. Rekapalli, and I. Jouline • HMMER – Protein Domain Identification tool • existing MPI-HMMER – limited performance, did not scale well • new HSP-HMMER – excellent performance (~100x faster than MPI-HMMER for 4096 cores) and scales well beyond 8000 cores • HSP-HMMER code brings down time to identify functional domains in millions of proteins from 2 months down to less than 20 hours. • HSP-HMMER paper accepted for publication in: ACM SAC 2009 Bioinformatics Track • Using a closely coupled supercomputer with high-bandwidth parallel I/O is crucial. • Further bioinformatics genomics research will benefit tremendously from the utilization of such powerful resources.

Position Specific Iterated – Basic Local Allignment Search ToolNCBI SOFTWARE DEVELOPMENT TOOLKIT National Center for Biotechnology Information, NIH First steps: serial version ‘blastpgp’ runs on 1 core of Cray XT4 and XT5 Studied the ncbi toolkit software and looked into possible MPI implementations of PSI-BLAST PSI-BLAST

Analyzing PSI-BLAST (and the whole NCBI toolkit!) –should we try to improve it? should we attempt convoluted MPI routines? > 500 programs written in C > 1,000,000 lines of code no MPI_function calls! (of course!) Answer: NO!Let’s first try a simple “ideally parallel method”! (nothing embarassing about that!) PSI-BLAST (cont’d)

Developing a Highly Scalable Parallel (HSP) - PSIBLAST Wrote an MPI-wrapper, modifying only the ncbimain.c program, and adding only some ~50 lines of code, It uses only 5 MPI_function calls ( MPI_Init, MPI_Finalize, MPI_Comm_size, MPI_Comm_rank, and MPI_Barrier) Using an initial set of 27 protein sequences “BLASTed” against the “nr” database with blastpgp running on 1, 8, 512, 1024, 2048 and 4096 tasks. Initial results: HSP-PSIBLAST

Highly Scalable Parallel (HSP) - PSIBLAST The graph below shows the best times for each run (several runs were done each time). Numerical results were compared and shown to be identical. Excellent scaling up to 1024 cores (tasks) Initial results: HSP-PSIBLAST (cont’d)

Improve the performance and scalability of HSP-PSIBLAST Similarly to HSP-HMMER we will pre-process the input data- sets so that randomized-sized protein sequences are submitted (by chunks of several thousands) to each MPI task We will test different variation of parallel I/O with lustre striping, and also using different numbers of OSTs Splitting up the “nr” database (1 GB in July 08, now 3 GB in March 09) might also be helpful (and eventually necessary, as it grows much more). This will require further improvements to the MPI I/O component to ensure optimal performance. Test performance and scalability of other NCBI routines e.g. blast, blastall, megablast, rpsblast, etc, Future steps: HSP-PSIBLAST

Summary • outcome (as of today): - developed HSP-HMMER on Cray XT4 and XT5 -it scales well up to 8000 cores -it is MUCH faster than MPI-HMMER- developed a parallel PSI-BLAST that runs up to 4096 cores- HSP-PSIBLAST scales very well up to 1024 cores Questions? Comments?Thank you!

HSP-HMMER vs MPI-HMMER & parallelization of PSI-BLAST