A new approach to protein structure prediction

? ? A new approach to protein structure prediction ? ? Peter Smooker, Heiko Schröder, Margaret Hamilton, Aditya, Mannan, Sundara, Saravanan, Rajalingam Aravinthan,Gad Abraham, Abdullah Al Amin, Nalinda, Prashant ? ?

What’s on today? • Predicting protein structures • Fast implementation • Special purpose HPC • Searching for structural similarity • Visualisation of proteins • Lots of speculation, some results!

Aim: Prediction of protein structures • Common methods: • Homology modelling – > 30% match  similar fold • Molecular modelling – only for small molecules • Crystallography very expensive, very slow and not always possible. • Only few structures are known and we are falling behind (<1%). • Major efforts are being made: e.g. Blue-Gene (fastest supercomputer (IBM)) • Linear time method?

15% 45% 120% Motivation • Genetic sequence databases are growing exponentially (maybe not?) • Growth rate will continue, since multiple concurrent genome projects have begun, with more to come

Mycobacterium Tuberculosis Mycobacterium Smegmatis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease  find common and different parts • 16106 pair-wise sequence comparisons • More clever ways? – I guess! • Many Genome-Genome Comparisons will be required in the near future

Homology Modeling • Discovered sequences are analyzed by comparison with databases • Complexity of sequence comparison is proportional to the product of query size times database size •  Analysis too slow on sequential computers • Two possible approaches • Heuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time

GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII |||::::| : |::| ||:::||||:|:|||:: ::| |:::: GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV Slower Search Speed Faster Data Quality Lower Higher Protein Sequence Alignment • BLAST, FastA, Smith-Waterman Smith- Waterman T=O(|S|) FastA BLAST

 A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 A T C T C G T A T G A T G 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10

Context sensitivity!

Protein folding • Our approach: • Linear method – we do not compute electromagnetic fields nature has done it for us! • Physical forces have short range (decreasing quadratic with the distance) • → context sensitivity: Find the same protein with the same context in the database – copy that structure.

Dihedral Angles • The 6 atoms in each peptide unit lie in the same plane -- φ and  are free to rotate • The structure of a protein is almost totally determined, if all angles φ and  are known

Abdullah Al Amin  φ

Abdullah Al Amin φ

Abdullah Al Amin 

ASN ARG ALA GLN CYS 3 4? 3 Ramachandran Plots # choices ASP HIS GLY GLU LYS 5 LEU ILE PRO PHE MET 2 2 2 TRP THR SER VAL TYR 2 Abdullah Al Amin

val-’val-ile val-’val-val val-’val-asn Which φ ? Abdullah Al Amin Σ val-’val-xxx

Abdullah Al Amin

Abdullah Al Amin φ  val-val-ala φ→ same AA φ →  neighbour

GLU-CYS’-ALA  GLU-’CYS-ALA φ confidence GLU-’CYS-SER φ GLU-CYS’-SER  # peaks? Abdullah Al Amin

Complexity – Reducing the size of search space Reducing the number of peaks. 2x size of search space 2X-Y assuming we have predicted Y angles with high confidence Our aim: Large Y (Y=X is not possible) Method: Increase the context Problem: Longer the context → fewer matches Example: 20k different sequences of length k. Ek =|PDB|/20k. k=3, E3 =1000. k=5, E5 =3. k=9, E9 =1/50000.

Hydrophobic (O) Hydrophil (I) Which context?? • I I O ALA LYS SER O O I (E=20) •  reduce number of peaks • Different lists for different groups of proteins? • (inside cells, outside cells), Saravanan •  reduce number of peaks • Short and perfect  to longer and less perfect? • Rajalingam Aravinthan, Gad Abraham •  reduce number of peaks • Reduce the size of the search space!

7 3 9 13 Rajalingam AravinthanGad Abraham

Prediction based on length 3

-Helix Abdullah Al Amin  φ

Why 9?

Suffix trie and suffix tree – fast search! a a a c c c c b c a a b Suffix trie for abcacbcabacb (all suffixes up to length 4). Find all strings that are similar to aacb (tolerance 1). Breadth first search! Prashant 0 a c b 1 1 0 b c a c a b 1 1 1 1 a c b c b c 1 1 1 a b b 1 1

Systola 1024: PC add-on board with 1024 processors (ISATEC, Germany) • Fuzion 150: 1536 processors on a single chip (Clearspeed Technology, UK) • FPGA ? Parallel Architectures for Bioinformatics • Embedded Massively Parallel Accelerators

Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Parallel Architectures for Bioinformatics • Supercomputer performance at low cost • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer

Speculation: Finding similar structures based on sequences of φs and s. We could search for a structure that has a high degree of similarity with a predicted structure (instead of similarity of the sequence – particularly in hydrophobic parts). Modify Smith-Waterman: What should be the penalty for gaps (do gaps make any sense?) – how do we treat confidence information?

H  A T C T C G T A T G A T G 0 0 0 0 0 0 2 1 0 0 2 1 0 2  0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 G 0 4 3 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 5 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 7 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 function ??? 9 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 8 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 10 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 ì 0 ï C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 - - H ( i 1 , j ) ï A T C T C G T A T G A T G = H ( i , j ) max í - - H ( i , j 1 ) 1 ï G T C T A T C A C ï - - + H ( i 1 , j 1 ) Sbt ( S 1 , S 2 ) î i j Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC =1, =1 1

Nalinda degrees difference 1500 Score = ------------------------- - 10 50 + (| ai – aj | x 0.9)2  =  = -10

Nalinda

Look ahead

Visualisation tool • Sequence of dihedral angles • Structure of protein • Visualise structure • Indicate confidence • Translate change of dihedral angle into change of 3D-structure • Emphasise physical collisions • Show positions for potential S-S bonds and hydrogen bonds • Show fields?

Speculation: • Simulation of the folding process: • Predict the structure of the following hydrophobic subsequence – needs to be tested whether hydrophobicity is highly correlated with being “inside a protein”. • Mark all positions of cysteines • Mark all positions of potential hydrogen bonds • Simulate the bending process • Look for similar structures “up to here similar” • Compare structures of identical O/I sequences • Compare surfaces (cut protein at a hydrophil position and look at the set of exposed hydrophobic amino acids) • Develop an algorithm to determine structural similarity, either based on dihedral angles or on Euclidian positions using dynamic programming. • With such an algorithm similar “surroundings” can be found. • Do new parts deform old parts significantly?

? ? ? ? ? ? ? ? ? ?

Thank you !

A new approach to protein structure prediction

A new approach to protein structure prediction

Presentation Transcript

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

An Optimization Approach to Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

DISTANCE MATRIX-BASED APPROACH TO PROTEIN STRUCTURE PREDICTION

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction