Parallel computational biochemistry
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Parallel Computational Biochemistry PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Parallel Computational Biochemistry. Proteins, DNA, etc. DNA encodes the information necessary to produce proteins. Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes). Proteins, DNA, etc.

Download Presentation

Parallel Computational Biochemistry

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel computational biochemistry

Parallel Computational Biochemistry


Parallel computational biochemistry

Proteins, DNA, etc.

DNA encodes the

information necessary

to produce proteins

Proteins are the main

molecular building blocks of life

(for example, structural proteins,

enzymes)


Parallel computational biochemistry

Proteins, DNA, etc.

  • Proteins are formed from a chain of molecules called amino acids


Parallel computational biochemistry

Proteins, DNA, etc.

  • The DNA sequence encodes the amino acid sequence that constitutes the protein


Parallel computational biochemistry

Proteins, DNA, etc.

  • There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...


Multiple sequence alignment

Multiple Sequence Alignment


Databases of biological sequences

Databases of Biological Sequences

NCBI:14,976,310sequences

15,849,921,438 nucleotides

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.

MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG

DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE

SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH

WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE

YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI

KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR

GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS

LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY

YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT

KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Swiss-Prot:104,559sequences

38,460,707 residues

PDB: 17,175 structures


Sequence comparison

Sequence comparison

  • Compare one sequence (target) to many sequences (database search)

  • Compare more than two sequences simultaneously


Applications

Applications

  • Phylogenetic analysis

  • Identification of conserved motifs and domains

  • Structure prediction


Phylogenetic analysis

Phylogenetic Analysis


Structure prediction

Structure Prediction

> RICIN GLYCOSIDASE

MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG

DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE

SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH

WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE

YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI

KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR

GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS

LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY

YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT

KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences

Protein structures

Genomic sequences


Clustal w

Clustal W


Progressive alignment

Human

Mouse

Drosophila

C.elegans

S.cerevisiae

Progressive Alignment

1. Do pairwise alignment of all

sequences and calculate

distance matrix

Scerevisiae [1]

Celegans [2] 0.640

Drosophia [3] 0.634 0.327

Human [4] 0.630 0.408 0.420

Mouse [5] 0.619 0.405 0.469 0.289

2. Create a guide tree based

on this pairwise distance

mat

3. Align progressively following guide tree.

• start by aligning most closely related pairs of sequences

• at each step align two sequences or one to an existing subalignment


Parallel clustal

Parallel pairwise (PW) alignment matrix

Parallel guide tree calculation

Parallel progressive alignment

Human

Mouse

Drosophila

C.elegans

S.cerevisiae

Parallel Clustal

Scerevisiae [1]

Celegans [2] 0.640

Drosophia [3] 0.634 0.327

Human [4] 0.630 0.408 0.420

Mouse [5] 0.619 0.405 0.469 0.289


Parallel clustal improvements

Parallel Clustal - Improvements

  • Optimization of input parameters

    • scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters.

  • Minimum Vertex Cover

    • use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.


Minimum vertex cover

Conflict Graph

vertex: sequence

edge: conflict (e.g. alignment with very poor score)

TASK: remove smallest number of gene sequences that eliminates all conflicts

Minimum Vertex Cover


Fpt algorithms

Phase 1: Kernelization

Reduce problem to size f(k)

Phase 2: Bounded Tree Search

Exhausive tree search; exponential in f(k)

FPT Algorithms


Kernelization

Kernelization

Buss's Algorithm for k-vertex cover

  • Let G=(V,E) and let S be the subset of vertices with degree k or more.

  • Remove S and all incident edges

    G->G’ k -> k'=k-|S|.

  • IF G' has more than k x k' edges

    THEN no k-vertex cover exists

    ELSE start bounded tree search on G'


Bounded tree search

Bounded Tree Search


Case 1 simple path of length 3

Case 1: simple path of length 3

remove selected vertices from G'

k' - = 2


Case 2 3 cycle

Case 2: 3-cycle

remove selected vertices from G'

k' - = 2


Case 3 simple path of length 2

Case 3: simple path of length 2

remove v1, v2 from G'

k' - = 1


Case 4 simple path of length 1

Case 4: simple path of length 1

remove v, v1 from G'

k' - = 1


Sequential tree search

Depth first search

backtrack when k'=0 and G'<>0 ("dead end" ))

stop when solution found (G'={}, k'>=0 )

Sequential Tree Search


Parallel tree search

Basic Idea:

Build top log p levels of the search tree (T ')

every proc. starts depth-first search at one leaf of T '

randomize depth-first search by selecting random child

Parallel Tree Search


Analysis balls in bins

Analysis: Balls-in-bins

sequential depth-first search path total length:L, #solutions: m

expected sequential time (rand. distr.): L/(m+1)

parallel search path

expected parallel time (rand. distr.): p + L/(p(m+1))

expected speedup: p / (1 + (m+1)/L)

if m << L then expected speedup = p


Simulation experiment

Simulation Experiment

L = 1,000,000


Implementation

Implementation

  • test platform:

    • 32 node Beowulf cluster

    • each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk

    • gcc and LAM/MPI on LINUX Redhat 7.2

  • code-s: Sequential k-vertex cover

  • code-p: Parallel k-vertex cover


Hpcvl

HPCVL

High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org)

Created by parallel computing researchers from

Carleton U. (Comp. Sci.)

Queen's (Engineering)

Ottawa U. (Life Sci./Hospital)

Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants


Test data

Test Data

  • Protein sequences

  • Same protein from several hundred species

  • Each protein sequence a few hundred amino acid residues in length

  • Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)


Test data1

Test Data

  • Somatostatin

    • neuropeptide involved in the regulation of many functions in different organ systems

    • Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255


Test data2

Test Data

  • WW

    • small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling

    • Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318


Test data3

Test Data

  • Kinase

    • large family of enzymes involved in cellular regulation

    • Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397


Test data4

Test Data

  • SH2 (src-homology domain 2)

    • involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine

    • Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397


Test data5

Test Data

  • Thrombin

    • protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin

    • Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413


Test data6

Test Data

  • PHD (pleckstrin homology domain)

    • involved in cellular signaling

    • Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603


Test data7

Test Data

  • Random Graph

    |V| = 220, |E| = 2155, k = 122, k' = 122

  • Grid Graph

    |V| = 289, |E| = 544, k = 145, k' = 145


Test data8

Test Data

|VC| ~ |V| / 2 k' = k


Sequential times

Sequential Times

Kinase, SH2, Thombin: n/a


Code p on virtual proc

Code-p on Virtual Proc.


Parallel times

Parallel Times


Speedup somatostatin

Speedup: Somatostatin


Speedup ww

Speedup: WW


Speedup rand graph

Speedup: Rand. Graph


Speedup grid graph

Speedup: Grid Graph


Thank you

Thank You!

  • Questions?


  • Login