1 / 50

Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Solutions for Molecular Sequence Analysis. Prabhakar R. Gudla CMSC 838T Presentation. Outline. Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusions and Comments. Motivation.

larya
Download Presentation

Massively Parallel Solutions for Molecular Sequence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation

  2. Outline • Motivation • Smith-Waterman Algorithm • Parallelization • High Performance Computing • Hybrid Architecture • Fuzion 150 • Performance Evaluation • Conclusions and Comments CMSC 838T – Presentation

  3. Motivation Discovered sequences are analyzed by comparison with databases Complexity is proportional to the product of query size times database size ☞Analysis too slow on sequential computers CMSC 838T – Presentation

  4. Slower Search Speed Faster Data Quality Lower Higher Sequence Alignment • Two possible approaches • Heuristics, e.g. BLAST, FASTA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time • BLAST, FASTA, Smith-Waterman (S-W) Smith- Waterman FASTA BLAST CMSC 838T – Presentation

  5. Outline • Motivation • Smith-Waterman Algorithm • Parallelization • High Performance Computing • Hybrid Architecture • Fuzion 150 • Performance Evaluation • Conclusion and Comments CMSC 838T – Presentation

  6. l1 A T C T C G P1 P2 P6  A T C T C G C T G  0 0 0 0 0 0 0 G 0 0 0 0 2 0 0 0 0 0 0 1 T 1 2 2 2 1 0 0 0 2 4 3 1 G T A T C T G G T C T G A T C T G C 1 4 C 4 0 1 0 0 6 5 4 3 T l2 T T 0 2 T C A T C T 2 0 0 2 5 5 4 3 C C A 2 A T A C T 2 0 4 3 4 4 4 2 T T T T A T G C T 0 0 3 6 5 6 5 1 A A C C T C 0 T T T C C C G Parallelization of S-W • matrix cells along a single diagonal are computed in parallel • comparison is performed in l1+l21 steps on l1 PEs 0 CMSC 838T – Presentation

  7. Systola 1024: PC add-on board with 1024 processors • Fuzion 150: 1536 processors on a single chip Parallel Architectures • Embedded Massively Parallel Accelerators • Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan CMSC 838T – Presentation

  8. Outline • Motivation • Smith-Waterman Algorithm • Parallelization • High Performance Computing • Hybrid Architecture • Fuzion 150 • Performance Evaluation • Conclusion and Comments CMSC 838T – Presentation

  9. Previous Applications • Volume Visualization [Schmidt `00] • Automatic Visual Quality Control (Automobile Industry) • Computer Tomography [Schmidt, Schimmler, and Schröder `98] • Video Compression [Schmidt and Schimmler `99] • Range of Transforms (Fourier, Wavelet, Hough, Radon) [Schmidt, Schimmler and Schröder `99] • Image Processing [Schimmler and Lang `96, Lenders and Schröder `90, Jiang Edirisinghe, and Schröder `97] CMSC 838T – Presentation

  10. Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Hybrid Architecture • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer CMSC 838T – Presentation

  11. RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32  32 mesh of processing elements • wavefront instruction execution CMSC 838T – Presentation

  12. a1023 a1022 a992 a63 a62 a32 bk….b1b0 a31 a30 a0 Mapping onto Systola 1024 • Subject sequences can be pipelined with only step delay  k steps for subject sequence of length k a: query sequence (equal to 1024) b: subject sequence …c1c0 X • Efficient routing on the ISA: Row Ringshift and Broadcast CMSC 838T – Presentation

  13. Fuzion 150 Architecture • 0.25-m, single-chip, SIMD architecture • 1536 PEs @ 200 MHz  300 GOPS • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth • multithreading (control units interact via semaphores) • developed by Clearspeed Technology (UK) for graphics, networking processing Linear SIMD Array 1536PEs each with 2 Kbytes DRAM SIMD Controller Instruction Fetch Local Memory Host AGP Rambus FUZION Bus 1,2 or 4 Channels (6.4 GB/s) 32-bit EPU (ARC) Video I/O Display CMSC 838T – Presentation

  14. Instructions ALU (8 bits) Register file 32 Bytes Left PE Right PE PE Memory 2 KByte DRAM Block I/O Channel Fuzion 150 Architecture Local Memory Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Block 1 PE (1,0) PE (1,1) PE (1,255) Block 0 PE (0,0) PE (0,1) PE (0,255) CMSC 838T – Presentation

  15. a1535 a1534 a1280 a511 a510 a256 a0 a1 a255 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) Block 1 b: subject sequence Block 0 bk….b1b0 …c1c0 X • No fast global communication  2-step local communication • Subject sequence can be pipelined with only step delay CMSC 838T – Presentation

  16. Contents • Motivation • Smith-Waterman Algorithm • Parallelization • High Performance Computing • Hybrid Architecture • Fuzion 150 • Performance Evaluation • Conclusion and Comments CMSC 838T – Presentation

  17. Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 1Ghz 12 88 22 97 42 102 82 105 162 106 Systola 1024 speedup to PIII 1Ghz 294 4 577 4 1137 4 2241 4 4611 4 Cluster of 16 Systolas speedup to PIII 1GHz 20 53 38 56 73 58 142 60 290 59 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length • Computing time dominates data transfer time • Fuzion 150 is 25 times faster than a single Systola 1024; difference in CMOS technology (0.25 vs 1.0) CMSC 838T – Presentation

  18. Performance Evaluation • Time comparisons for a 10 Mbase search on different parallel architectures with different query length • 4faster than 16K-PE MasPar • 6faster than Kestrel • 5faster than SAMBA (special-purpose 3-board architecture) CMSC 838T – Presentation

  19. Performance Evaluation USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II* Merc1: 1-board Mercury+ Bcll-1: Biocellerator* Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF+ Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* + (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999 CMSC 838T – Presentation

  20. Outline • Motivation • Smith-Waterman Algorithm • Parallelization • High Performance Computing • Hybrid Architecture • Fuzion 150 • Performance Evaluation • Conclusions and Comments CMSC 838T – Presentation

  21. Conclusions • Demonstrated how fine-grained and hybrid parallel architectures can be applied efficiently for Comparative Genomics • Significant runtime savings for full genome comparisons and database searching • Same systems can be used for accelerating other bioinformatics applications, e.g. Hidden Markov Models CMSC 838T – Presentation

  22. Comments ☞ With hardware support, is S-W as fast as BLAST? Comparative search speeds on 600 MHz 21264A Alpha machine (comparable MCUPS as Hybrid System and Fuzion 150) * Source: Shane Sturrock, SCS, 2(1), April 2002 CMSC 838T – Presentation

  23. Comments ☞ Is it feasible to use S-W as the default ? • Currently offered as a default option at EBI (European Bioinformatics Institute), handles 15K queries per month w/ full implementation of S-W • Depends on the “objectives” of the search ☞ Just how much more accurate is S-W ? • 5-10% more “sensitive” towards divergent matches than BLAST (Shpaer et. al., Genomics 38, 179-191, 1996) • BLAST will retrieve most biologically significant similarities, but will miss a few and will include some chance similarities CMSC 838T – Presentation

  24. Comparison of S-W VS BLAST • Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996 ☞ Is there a real difference in the results ? • YES CMSC 838T – Presentation

  25. Comparison of S-W, FASTA, and BLAST Note: The numbers in the table show for how many protein SF the method in the column performed better than the one in the row CMSC 838T – Presentation

  26. Acknowledgements Dr. Bertil Schmidt Dr. Chau-Wen Tseng CMSC 838T – Presentation

  27. Q&A CMSC 838T – Presentation

  28. Extra Slides CMSC 838T – Presentation

  29. Mycobacterium Smegmatis Mycobacterium Tuberculosis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease  find common and different parts • 16106 pairwise sequence comparisons CMSC 838T – Presentation

  30. Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment • Complexity O(nm) for sequence lengths n and m • Based on the 'dynamic programming' (DP) algorithm • Fill the DP matrix using a substitution (mutation) matrix • Find the maximal value (score) in the matrix • Trace back from the score until a 0 value is reached CMSC 838T – Presentation

  31. Smith-Waterman Algorithm • Aligning S1 and S2 of length l1 and l2 using recurrences: • Calculate three possible ways to extend the alignment • by one aminoacid (AA) in each sequence • by one AA in the first sequence and align it with a gap in the second • by one AA in the second sequence and align it with a gap in the first CMSC 838T – Presentation

  32. A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 A T C T C G T A T G A T G G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10 CMSC 838T – Presentation

  33. Principles of the ISA ..... ..... CMSC 838T – Presentation

  34. Principles of the ISA Communication- Register CMSC 838T – Presentation

  35. Interface Processors Interface Processors North . . . . Interface Processors West ISA . . . . CMSC 838T – Presentation

  36. - + - - * - + - - - * * * * + - + * + - + + * * - + + * * + + * - + + * * + + + + * * column selectors + instructions - * + * - + - + - * row selectors Instruction Systolic Array • wavefront instruction execution  fast accumulation operations (e.g. row sum, broadcast, ringshift) CMSC 838T – Presentation

  37. C := CW C := CW C := CW C := CW 234 234 234 234 C = 234 C = 234 C = 234 C = 234 C = 234 C = 234 C = 234 C = 0 C = 234 C = 234 C = 0 C = 0 C = 0 C = 0 C = 234 C = 0 noop C:=C+CW C:=C+CW C:=C+CW C = 1 C = 1 C = 1 C = 1 C = 2 C = 3 C = 3 C = 3 C = 6 C = 6 C = 3 C = 3 C = 10 C = 4 C = 4 C = 4 C:=CE C := CW C := CW noop C := CW C:=CW C := CW C := CW C:=CE C := CW C:=CE C:=CW C:=CW C = 1000 C = 1 C = 1 C = 1 C = 1 C = 1 C = 1000 C = 1 C = 1 C = 1 C = 1 C = 1000 C = 1 C = 1 C = 1 C = 1000 Advantage of ISA’s: Performing Aggregate Functions • Row Broadcast C := C[WEST] • Row Sum C := C + C[WEST] • Row Ringshift C := C[WEST]; C:=C[EAST] CMSC 838T – Presentation

  38. Data Transfer • In Systola 1024, • input of new character (bj) into the lower western IP, and • when l1> 2048, the input of previously computed H, E, and F cells and output of H, E, and F cells • For Fuzion 150, during the 16 new H-cells in each PE, one new character is input via Fuzion bus CMSC 838T – Presentation

  39. Instruction Counts • Instruction Count (IC) to update 2 and 16 H-cells in Systola 1024 and Fuzion 150, respectively: CMSC 838T – Presentation

  40. Maximum Characters/PE • The memory per PE on Systola is 32 (16-bit) registers • 2 characters per PE is the maximal possible • (2 chars x 20 AAs substitution row x 8-bit per substitution value = 20 registers) • The memory per PE on Fuzion is 2Kb • maximum chars per PE is 16 • restricted due to “indirect addressing” per PE CMSC 838T – Presentation

  41. Indirect Address • An addressing mode found in many processors' instruction sets where the instruction contains the address of a memory location which contains the address of the operand (the "effective address") or specifies a register which contains the effective address CMSC 838T – Presentation

  42. Myrinet - Overview • Myrinet is a cost-effective, high-performance, packet-communication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers • Conventional networks (e.g., ethernet) can be used to build clusters, but do not provide the performance/features required for HPC or high-availability clustering CMSC 838T – Presentation

  43. Myrinet - Characteristics • Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports • Flow control, error control, and "heartbeat" continuity monitoring on every link • Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications • Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts • Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets CMSC 838T – Presentation

  44. lq processors: Hybrid Query sequence = M, Number of processors in ISA = N2, assuming M = k x N: • k  N: Each k x N subarray computes the alignment of the same query sequence with different subject sequences • k ≥ N : • k/N = 2: load 2 chars per PE • k/N > 2: split query sequence into k/2N passes and load 2N2 chars in each pass CMSC 838T – Presentation

  45. lq processors: Fuzion 150 Length of query sequence = M, Number of processors = 1536: • k x M = 1536: k alignments of same query sequence w/ different subject sequences carried out in parallel • k x 1536 = M: • Split into k passes – requires I/O of intermediate results in each step • Data transfers can be minimized by assigning k/M chars per PE – currently 16 chars per PE is the limit CMSC 838T – Presentation

  46. Concept of true and false hits The following cases were distinguished: • true positives, alignments between proteins of similar structure that fall above a given threshold (defined by the sequence alignment method) • false positives, alignments between proteins of dissimilar structure that fall above a given threshold of the sequence alignment • true negatives, alignments between proteins of dissimilar structure that that fall below a given threshold • false negatives, alignments between proteins of similar structure that fall below a given threshold CMSC 838T – Presentation

  47. Guidelines When to use S-W ? • if you are looking for a protein distantly related to your query sequence (e.g., you have a known protein sequence and you want to find possible distant homologues) • if you are looking for the protein encoded in your low-quality DNA query sequence (e.g., you have a badly sequenced cDNA clone) • if you are looking for a DNA sequence corresponding to your protein query sequence (e.g., you want to identify potential homologues of your protein in the EST databases) When to use BLAST ? • if you are looking for close matches and you don't mind missing lower homology sequences • if you want a quick answer CMSC 838T – Presentation

  48. Performance Evaluation of SAMBA Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997 ☞ The longer the query length, the better the speed-up CMSC 838T – Presentation

  49. Performance Evaluation of Kestrel USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II* Merc1: 1-board Mercury+ Bcll-1: Biocellerator* Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF+ Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* + (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999 CMSC 838T – Presentation

  50. Performance Evaluation of Splash-2 Source: Hoang, IEEE-CMM, 185-191, 1993 CMSC 838T – Presentation

More Related