1 / 14

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL. Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov, Haruna Cofer, Roberto Gomperts SGI. Problem Statement. Multiple Sequence Alignment (MSA)

louis
Download Presentation

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov, Haruna Cofer, Roberto Gomperts SGI

  2. Problem Statement • Multiple Sequence Alignment (MSA) • Basis for phylogenetic analysis - Infer homology relationships • Building protein families - conserved region may imply common function • Aids in function/structure prediction of new proteins • Global MSA – Clustal W • Is it computationally expensive ? Yes, for 100 sequences. • Goal : Parallelize Clustal W • Clustal W takes hours for 100 or more sequences • Parallelization possible for the algorithm • Contribution of the paper • Parallel Clustal W • Parallel version of basic Clustal W • HT Clustal • Parallelize heterogeneous Multiple Sequence Alignment problems • MULTICLUSTAL • Parallel version of an optimization on Clustal W CMSC 838T – Presentation

  3. Talk Overview • Overview of talk • Motivation • Background • Sequential Clustal W • Parallel Clustal W • HT Clustal • Problem Statement • Optimizations • MULTICLUSTAL • Sequential Algorithm • Optimizations • Observations CMSC 838T – Presentation

  4. Introduction • Sequential Clustal W Algorithm • Given N sequences of length M each • Pairwise Alignment (PA) • Creates distance matrix N x N based on pairwise alignment scores • Evolutionary distance • Guide Tree (GT) construction (Phylogenetic tree) • Use Neighbor-joining algorithm • Progressive Multiple Alignment (PA) • Use guide tree to align closely related pairs of sequences • Progressively align next sequence to existing alignment CMSC 838T – Presentation

  5. Parallel Clustal W • Problem Statement • Parallelize the Sequential Clustal W • Execution time breakup • PW = pairwise alignment, GT = guide tree, PA = progressive alignment CMSC 838T – Presentation

  6. Parallel Clustal W • Pairwise Alignment Stage • N(N-1)/2 pairwise alignments • Send them randomly to different processors • Random – as jobs of different load • Random also produces statistically uniform distribution (over a large set of jobs) • 1.8X speedup achieved on a 1000 sequence MSA with 8 CPUs • Guide Tree Stage • Parallelize “find closest neighbors from distance matrix” • Used in the neighbor joining algorithm • Find minimum element of each row concurrently • Use this to find minimum element of matrix CMSC 838T – Presentation

  7. Parallel Clustal W • Progressive Alignment Stage • Computation of a function score(I,J) precomputed in parallel • Alignment score of sequence I and J • Not much parallelization in the third stage • Overall Speedup • Speedup of 10x for 600 MA sequences using 16 CPUs • Time reduced from 1 hr 7 minutes to 6.5 minutes • Relative scaling is better for larger inputs CMSC 838T – Presentation

  8. HT Clustal • Problem Statement • Calculate large numbers of MSAs of various sizes (independent problems) • Such problems seen in high-throughput (HT) research environments • Representative Problem (from paper) : • Perform independent MSA over 100 sets of sequences • Each set has between 20 to 100 sequences with average of 60 sequences • Average Length of sequence = 390 CMSC 838T – Presentation

  9. HT Clustal - Optimizations • Basic Idea • Each MSA operation (on one set of sequences) is independent of the other • Run ClustalW as a uniprocessor job on one MSA problem • Launch multiple Clustal W jobs on different processors • Job Scheduling • Jobs of different duration – depends on sequence set • Two scheduling options explored: • Schedule dynamically – if processor is free, schedule an MSA job – chosen randomly • Schedule dynamically – Sequences are presorted (based on filesize) CMSC 838T – Presentation

  10. HT Clustal – Performance Numbers • Speedups • Almost linear speedups • 31x on 32 CPUs for the representative MSA problem • 116X on 128 CPUs for a larger test case • Solution time reduced from 18.5 hours to 9.5 minutes • Speedup shown for the example MSA set: CMSC 838T – Presentation

  11. HT Clustal – Effect of Presorting • Effect of presorting • Figure shows effect of presorting for the example MSA set 32 CPUs, 100 sets, ~3 jobs per CPU • If average number of jobs per CPU < 5 presorting helps • For larger number of jobs per CPU statistical averaging reduces load imbalance CMSC 838T – Presentation

  12. MULTICLUSTAL • MULTICLUSTAL Algorithm • A Perl script to generate high quality MSA with little user intervention • Searches for best combination of Clustal W input parameters • To reduce gaps, increase clustering • Parameters to vary : • Scoring matrices : pairwise and multiple • Gap open and extension penalties (pairwise and multiple) • Sequential Algorithm : • Till all parameters are sufficiently varied { • alignment = Run Clustal W () • Calculate quality of alignment • Change Parameters } • Quality of alignment • A numerical quantity based on • identitical amino acid matches • Conservative amino acid substitutions • Gap events, amino acid islands I.e. –X-, -XX-, -XXX-, -XXXX- CMSC 838T – Presentation

  13. MULTICLUSTAL Optimizations • Optimization on MULTICLUSTAL • Run Clustal W once • Reuse tree generated in the PW/GT Stages • Guide tree calculated only once for multiple runs • Results in speedups from 1.5X to 3X • Use Parallel Clustal W for each run of Clustal W CMSC 838T – Presentation

  14. Observations • Parallelizability • First (pairwise alignment) and second (guide tree) stages are parallelizable • Third stage is mostly sequential – speedup limited • 100 sequence MSAs possible ? • PIR at NBRF (Georgetown University) takes maximum of 20 sequences for MSA • Speedup improves user response, for 20 sequences a PC would be sufficient • Probable applications: • Research Environments ? • PIR servers ? • Speedup only on shared memory SGI 3000 workstation ? CMSC 838T – Presentation

More Related