HSP-HMMER vs MPI-HMMER & parallelization of PSI-BLAST. Christian Halloy firstname.lastname@example.org April 23, 2009. HMMER, PFAM, and PSI-BLAST. If you BLAST a protein sequence (or a translated nucleotide sequence) BLAST will look for known domains in the query sequence.
HSP-HMMER vs MPI-HMMER& parallelization of PSI-BLAST
April 23, 2009
(Fig above): A part of an alignment for the Globin family from the Pfam website
HMMER’s hmmpfam code searches an HMM-PFAM database for matches to a query sequence.
(Left Figure): molecular rendering of Luciferase protein
Analyzing MPI-HMMER – should we try to improve it?
97 programs written in C
~48,000 lines of code
508 MPI_function callssuch as MPI_Init, MPI_Bcast, MPI_Request, MPI_BarrierMPI_Send, MPI_Recv, MPI_Pack, MPI_Unpack, MPI_Wait, …
Master – Workers paradigm
a LOT of I/O going on
Highly Scalable Parallel (HSP) - HMMER
Use the serial HMMER code, split the data, launch with MPI
Input data – thousands of protein sequences
A few details
The new HSP-HMMER code uses:
only 4 MPI_function calls ! ( MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize)
it adds only ~100 lines of code to hmmpfam.c
Although initial performance was better than MPI-HMMER, the scaling leveled off (but did not decrease) at ~1000 cores
Intense simultaneous I/O to the Lustre file system was still creating too much slowdowns.
‘feeding’ similar lengths of protein sequences to all nodes produces a synchronized effect, thus I/O bottlenecks
Reorganize the input data so that a mixture of protein sequences of different lengths are given to each processorThis ensures a randomization of the I/O activity, minimizes bottlenecks. Performance gained: another 3 or 4 x
Opening a single file (for reads or writes) from 1000 or more processors overloads the MDS – MetaData Server
Reads/writes of many different files from many different processors overloads the default 4 OSTs (Object Storage Targets)
Subdivide total input data into multiple data files
Give more work to each processor (more sequences) and write outputs using the Lustre striping mechanism -distribute I/O activities among more OSTs (Object Storage Targets), but have each processor contact only one OST -use pthreads to improve utilization of multicores, more memory, and at the same time reducing number of I/O requests. Another gain of 2 to 3x was observed.
Identifying the Pfam domains in all 6.5 million proteins of the “nr” (non redundant) database takes less than 24 hours when using HSP-HMMER on 2048 dual threaded processors. This would have taken ~2 months with MPI-HMMER
This is critical, considering that the protein database is doubling in size every 6 months!
Position Specific Iterated – Basic Local Allignment Search ToolNCBI SOFTWARE DEVELOPMENT TOOLKIT
National Center for Biotechnology Information, NIH
serial version ‘blastpgp’ runs on 1 core of Cray XT4 and XT5
Studied the ncbi toolkit software and looked into possible MPI implementations of PSI-BLAST
Analyzing PSI-BLAST (and the whole NCBI toolkit!) –should we try to improve it? should we attempt convoluted MPI routines?
> 500 programs written in C
> 1,000,000 lines of code
no MPI_function calls! (of course!)
Answer: NO!Let’s first try a simple “ideally parallel method”! (nothing embarassing about that!)
Developing a Highly Scalable Parallel (HSP) - PSIBLAST
Wrote an MPI-wrapper, modifying only the ncbimain.c program, and adding only some ~50 lines of code,
It uses only 5 MPI_function calls ( MPI_Init, MPI_Finalize, MPI_Comm_size, MPI_Comm_rank, and MPI_Barrier)
Using an initial set of 27 protein sequences “BLASTed” against the “nr” database with blastpgp running on 1, 8, 512, 1024, 2048 and 4096 tasks.
Highly Scalable Parallel (HSP) - PSIBLAST
The graph below shows the best times for each run (several runs were done each time). Numerical results were compared and shown to be identical.
Excellent scaling up to 1024 cores (tasks)
Improve the performance and scalability of HSP-PSIBLAST
Similarly to HSP-HMMER we will pre-process the input data- sets so that randomized-sized protein sequences are submitted (by chunks of several thousands) to each MPI task
We will test different variation of parallel I/O with lustre striping, and also using different numbers of OSTs
Splitting up the “nr” database (1 GB in July 08, now 3 GB in March 09) might also be helpful (and eventually necessary, as it grows much more). This will require further improvements to the MPI I/O component to ensure optimal performance.
Test performance and scalability of other NCBI routines
e.g. blast, blastall, megablast, rpsblast, etc,
Questions? Comments?Thank you!