1 / 51

Parallel Computational Biochemistry

Parallel Computational Biochemistry. Proteins, DNA, etc. DNA encodes the information necessary to produce proteins. Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes). Proteins, DNA, etc.

buzz
Download Presentation

Parallel Computational Biochemistry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computational Biochemistry

  2. Proteins, DNA, etc. DNA encodes the information necessary to produce proteins Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

  3. Proteins, DNA, etc. • Proteins are formed from a chain of molecules called amino acids

  4. Proteins, DNA, etc. • The DNA sequence encodes the amino acid sequence that constitutes the protein

  5. Proteins, DNA, etc. • There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...

  6. Multiple Sequence Alignment

  7. Databases of Biological Sequences NCBI:14,976,310sequences 15,849,921,438 nucleotides >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Swiss-Prot:104,559sequences 38,460,707 residues PDB: 17,175 structures

  8. Sequence comparison • Compare one sequence (target) to many sequences (database search) • Compare more than two sequences simultaneously

  9. Applications • Phylogenetic analysis • Identification of conserved motifs and domains • Structure prediction

  10. Phylogenetic Analysis

  11. Structure Prediction > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures Genomic sequences

  12. Our Contributions • Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences) • Parallel Clustal W (ICCSA 2003) • In progress: “Clustal XP” portal at http://cgm.dehne.net

  13. Clustal W

  14. Human Mouse Drosophila C.elegans S.cerevisiae Progressive Alignment 1. Do pairwise alignment of all sequences and calculate distance matrix Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289 2. Create a guide tree based on this pairwise distance matrix 3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences • at each step align two sequences or one to an existing subalignment

  15. Parallel pairwise (PW) alignment matrix Parallel guide tree calculation Parallel progressive alignment Human Mouse Drosophila C.elegans S.cerevisiae Parallel Clustal Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289

  16. Relative Speedup

  17. SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts Clustal XP vs. SGI

  18. Parallel Clustal - Improvements • Optimization of input parameters • scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters. • Minimum Vertex Cover • use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.

  19. Conflict Graph vertex: sequence edge: conflict (e.g. alignment with very poor score) TASK: remove smallest number of gene sequences that eliminates all conflicts NP-complete Minimum Vertex Cover

  20. Phase 1: Kernelization Reduce problem to size f(k) Phase 2: Bounded Tree Search Exhausive tree search; exponential in f(k) FPT Algorithms

  21. Kernelization Buss's Algorithm for k-vertex cover • Let G=(V,E) and let S be the subset of vertices with degree k or more. • Remove S and all incident edges G->G’ k -> k'=k-|S|. • IF G' has more than k x k' edges THEN no k-vertex cover exists ELSE start bounded tree search on G'

  22. Bounded Tree Search

  23. Case 1: simple path of length 3 remove selected vertices from G' k' - = 2

  24. Case 2: 3-cycle remove selected vertices from G' k' - = 2

  25. Case 3: simple path of length 2 remove v1, v2 from G' k' - = 1

  26. Case 4: simple path of length 1 remove v, v1 from G' k' - = 1

  27. Depth first search backtrack when k'=0 and G'<>0 ("dead end" )) stop when solution found (G'={}, k'>=0 ) Sequential Tree Search

  28. Basic Idea: Build top log p levels of the search tree (T ') every proc. starts depth-first search at one leaf of T ' randomize depth-first search by selecting random child Parallel Tree Search

  29. Analysis: Balls-in-bins sequential depth-first search path total length:L, #solutions: m expected sequential time (rand. distr.): L/(m+1) parallel search path expected parallel time (rand. distr.): p + L/(p(m+1)) expected speedup: p / (1 + (m+1)/L) if m << L then expected speedup = p

  30. Simulation Experiment L = 1,000,000

  31. Implementation • test platform: • 32 node HPCVL Beowulf cluster • each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk • gcc and LAM/MPI on LINUX Redhat 7.2 • code-s: Sequential k-vertex cover • code-p: Parallel k-vertex cover

  32. Test Data • Protein sequences • Same protein from several hundred species • Each protein sequence a few hundred amino acid residues in length • Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

  33. Test Data • Somatostatin • neuropeptide involved in the regulation of many functions in different organ systems • Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

  34. Test Data • WW • small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling • Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

  35. Test Data • Kinase • large family of enzymes involved in cellular regulation • Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

  36. Test Data • SH2 (src-homology domain 2) • involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine • Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

  37. Test Data • Thrombin • protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin • Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

  38. Test Data • PHD (pleckstrin homology domain) • involved in cellular signaling • Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

  39. Test Data • Random Graph |V| = 220, |E| = 2155, k = 122, k' = 122 • Grid Graph |V| = 289, |E| = 544, k = 145, k' = 145

  40. Test Data |VC| ~ |V| / 2 k' = k

  41. Sequential Times Kinase, SH2, Thombin: n/a

  42. Code-p on Virtual Proc.

  43. Parallel Times

  44. Speedup: Somatostatin

  45. Speedup: WW

  46. Speedup: Rand. Graph

  47. Speedup: Grid Graph

  48. Web Portal Clustal XP Parallel FPT MVC + … Parallel Clustal Clustal W Clustal XP X : Extended P : Parallel in progress

  49. Clustal XP http://cgm.dehne.net

More Related