parallel computational biochemistry
Download
Skip this Video
Download Presentation
Parallel Computational Biochemistry

Loading in 2 Seconds...

play fullscreen
1 / 47

Parallel Computational Biochemistry - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Parallel Computational Biochemistry. Proteins, DNA, etc. DNA encodes the information necessary to produce proteins. Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes). Proteins, DNA, etc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Computational Biochemistry' - bedros


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2

Proteins, DNA, etc.

DNA encodes the

information necessary

to produce proteins

Proteins are the main

molecular building blocks of life

(for example, structural proteins,

enzymes)

slide3

Proteins, DNA, etc.

  • Proteins are formed from a chain of molecules called amino acids
slide4

Proteins, DNA, etc.

  • The DNA sequence encodes the amino acid sequence that constitutes the protein
slide5

Proteins, DNA, etc.

  • There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...
databases of biological sequences
Databases of Biological Sequences

NCBI:14,976,310sequences

15,849,921,438 nucleotides

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.

MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG

DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE

SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH

WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE

YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI

KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR

GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS

LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY

YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT

KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Swiss-Prot:104,559sequences

38,460,707 residues

PDB: 17,175 structures

sequence comparison
Sequence comparison
  • Compare one sequence (target) to many sequences (database search)
  • Compare more than two sequences simultaneously
applications
Applications
  • Phylogenetic analysis
  • Identification of conserved motifs and domains
  • Structure prediction
structure prediction
Structure Prediction

> RICIN GLYCOSIDASE

MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG

DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE

SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH

WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE

YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI

KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR

GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS

LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY

YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT

KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences

Protein structures

Genomic sequences

progressive alignment

Human

Mouse

Drosophila

C.elegans

S.cerevisiae

Progressive Alignment

1. Do pairwise alignment of all

sequences and calculate

distance matrix

Scerevisiae [1]

Celegans [2] 0.640

Drosophia [3] 0.634 0.327

Human [4] 0.630 0.408 0.420

Mouse [5] 0.619 0.405 0.469 0.289

2. Create a guide tree based

on this pairwise distance

mat

3. Align progressively following guide tree.

• start by aligning most closely related pairs of sequences

• at each step align two sequences or one to an existing subalignment

parallel clustal
Parallel pairwise (PW) alignment matrix

Parallel guide tree calculation

Parallel progressive alignment

Human

Mouse

Drosophila

C.elegans

S.cerevisiae

Parallel Clustal

Scerevisiae [1]

Celegans [2] 0.640

Drosophia [3] 0.634 0.327

Human [4] 0.630 0.408 0.420

Mouse [5] 0.619 0.405 0.469 0.289

parallel clustal improvements
Parallel Clustal - Improvements
  • Optimization of input parameters
    • scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters.
  • Minimum Vertex Cover
    • use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.
minimum vertex cover
Conflict Graph

vertex: sequence

edge: conflict (e.g. alignment with very poor score)

TASK: remove smallest number of gene sequences that eliminates all conflicts

Minimum Vertex Cover
fpt algorithms
Phase 1: Kernelization

Reduce problem to size f(k)

Phase 2: Bounded Tree Search

Exhausive tree search; exponential in f(k)

FPT Algorithms
kernelization
Kernelization

Buss\'s Algorithm for k-vertex cover

  • Let G=(V,E) and let S be the subset of vertices with degree k or more.
  • Remove S and all incident edges

G->G’ k -> k\'=k-|S|.

  • IF G\' has more than k x k\' edges

THEN no k-vertex cover exists

ELSE start bounded tree search on G\'

case 1 simple path of length 3
Case 1: simple path of length 3

remove selected vertices from G\'

k\' - = 2

case 2 3 cycle
Case 2: 3-cycle

remove selected vertices from G\'

k\' - = 2

case 3 simple path of length 2
Case 3: simple path of length 2

remove v1, v2 from G\'

k\' - = 1

case 4 simple path of length 1
Case 4: simple path of length 1

remove v, v1 from G\'

k\' - = 1

sequential tree search
Depth first search

backtrack when k\'=0 and G\'<>0 ("dead end" ))

stop when solution found (G\'={}, k\'>=0 )

Sequential Tree Search
parallel tree search
Basic Idea:

Build top log p levels of the search tree (T \')

every proc. starts depth-first search at one leaf of T \'

randomize depth-first search by selecting random child

Parallel Tree Search
analysis balls in bins
Analysis: Balls-in-bins

sequential depth-first search path total length:L, #solutions: m

expected sequential time (rand. distr.): L/(m+1)

parallel search path

expected parallel time (rand. distr.): p + L/(p(m+1))

expected speedup: p / (1 + (m+1)/L)

if m << L then expected speedup = p

implementation
Implementation
  • test platform:
    • 32 node Beowulf cluster
    • each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk
    • gcc and LAM/MPI on LINUX Redhat 7.2
  • code-s: Sequential k-vertex cover
  • code-p: Parallel k-vertex cover
hpcvl
HPCVL

High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org)

Created by parallel computing researchers from

Carleton U. (Comp. Sci.)

Queen\'s (Engineering)

Ottawa U. (Life Sci./Hospital)

Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants

test data
Test Data
  • Protein sequences
  • Same protein from several hundred species
  • Each protein sequence a few hundred amino acid residues in length
  • Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)
test data1
Test Data
  • Somatostatin
    • neuropeptide involved in the regulation of many functions in different organ systems
    • Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k\' = 255
test data2
Test Data
  • WW
    • small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling
    • Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k\' = 318
test data3
Test Data
  • Kinase
    • large family of enzymes involved in cellular regulation
    • Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k\' = 397
test data4
Test Data
  • SH2 (src-homology domain 2)
    • involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine
    • Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k\' = 397
test data5
Test Data
  • Thrombin
    • protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin
    • Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k\' = 413
test data6
Test Data
  • PHD (pleckstrin homology domain)
    • involved in cellular signaling
    • Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k\' = 603
test data7
Test Data
  • Random Graph

|V| = 220, |E| = 2155, k = 122, k\' = 122

  • Grid Graph

|V| = 289, |E| = 544, k = 145, k\' = 145

test data8
Test Data

|VC| ~ |V| / 2 k\' = k

sequential times
Sequential Times

Kinase, SH2, Thombin: n/a

thank you
Thank You!
  • Questions?
ad