1 / 45

Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas. Computer Science and Artificial Intelligence Laboratory

emi-lynn
Download Presentation

Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Secondary Structure of All-Helical Proteins UsingHidden Markov Support Vector MachinesBlaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Workshop on Pattern Recognition in Bioinformatics – August 20, 2006

  2. Protein Structure Prediction • Classical problem: given sequence, predict structure • High-level approaches 1. Energy-minimization (ab-initio) techniques - Elegant, but often lack correct parameters 2. Homology-based techniques - Useful, but hard to predict new proteins Sequence Structure Our approach: Use energy minimization, butlearn parameters from existing proteins

  3. Our Framework (Training) Protein Data Bank Correct structure Amino-acid Sequence Prediction Algorithm Energy Parameters Predictedstructure LearningAlgorithm correct incorrect Done! Constraints energy(incorrect) > energy(correct)

  4. Our Framework (Testing) Amino-acid Sequence Prediction Algorithm Energy Parameters Predictedstructure

  5. Initial Focus: Secondary Structure • Classify each residue as alpha helix, beta strand, coil • In this paper, restrict to all-alpha proteins • Applications: • Informing tertiary structure predictors • Identification of homologous proteins • Identification of active sites (coils)

  6. Secondary Structure Predictors

  7. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods HMMs

  8. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks HMMs

  9. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks SVMs HMMs

  10. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks SVMs HMMs HMMs

  11. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods 1400-2900 parameters Neural Networks Neural Networks SVMs HMMs HMMs 680 MB of support vectors 471 parameters • Exploits biochemical models • Offers biological insight

  12. Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods 302 params 1400-2900 parameters Neural Networks Neural Networks SVMs HMMs HMMs 680 MB of support vectors 471 parameters • Exploits biochemical models • Offers biological insight

  13. Our Framework Applied to Helix Prediction Protein Data Bank Alpha Helices Correct structure Amino-acid Sequence MNIFEMLRIDEGLHHHHHHHHH HiddenMarkov Model Prediction Algorithm Energy Parameters Support Vector Machines Predictedstructure LearningAlgorithm correct incorrect Done! Constraints energy(incorrect) > energy(correct)

  14. Energy Parameters 302 Total

  15. Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = 302 Total

  16. Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) 302 Total

  17. Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) 302 Total

  18. Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total

  19. Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total

  20. Learning the Parameters Feature Space Energy ( ) = HA*A + HG*G = w¢[A G] Legal structure Correct structure where w represents the energy parameters [HA HG] G: # of Glycines in Helices Highest energy in direction of energy parameters w A: # of Alanines in Helices

  21. Learning the Parameters Feature Space Energy ( ) = HA*A + HG*G = w¢[A G] Legal structure Correct structure where w represents the energy parameters [HA HG] G: # of Glycines in Helices Highest energy in direction of energy parameters w w A: # of Alanines in Helices

  22. Learning the Parameters Feature Space 1. Predict stucture Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  23. Learning the Parameters Feature Space 1. Predict stucture Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices

  24. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices Separating Hyperplane A: # of Alanines in Helices

  25. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w Separating Hyperplane A: # of Alanines in Helices

  26. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  27. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  28. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices

  29. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices

  30. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  31. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  32. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  33. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices

  34. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices

  35. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  36. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  37. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  38. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure 8. Terminate Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices

  39. Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure 8. Terminate Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices Details in paper: - How to converge faster - Early termination condition - w A: # of Alanines in Helices [Tsochantaridis et al., ICML’02]

  40. Experimental Methodology • Data set: 300 non-homologous all-alpha proteins • From EVA’s sequence-unique subset of the PDB, July 2005 • Only consider alpha helices (“H” symbol in DSSP) • Randomly split into 150 training, 150 test proteins

  41. Results • Comparison to others • Best HMM method to date that does not utilize alignment info • Offers 3.5% (Q), 0.2% (SOV) over previous best • Lags behind neural networks; e.g., Porter overall SOV = 76.6% • However, we could likely gain 6-8% from alignment profiles • Caveats • Moving beyond all-alpha proteins, we could suffer 3% • By considering 3/10 helices, we could decrease 2% [Nguyen02] [Rost93] [Jones99]

  42. Conclusions • Represents first step toward learning biophysical parameters for energy minimization techniques • Iterative, demand-driven learning process using SVMs • Promising results on alpha-helix prediction • 77.6% among best Q for methods without alignment info • Future work: super-secondary structure • Will predict full “contact maps” rather than 3-state labels • For beta sheets, replace HMMs by multi-tape grammars http://protein.csail.mit.edu/

  43. Extra Slides

  44. Prediction Algorithm • Parameters represent energetic benefitof a given feature in a protein structure • Features are fixed, chosen by designer • Example features: • Number of prolines in an alpha helix • Number of coils shorter than 2 residues • Energy (structure) = features 2 structure Energy (feature) • Minimal-energy structure found with dynamic prog. • Idea: consider all structures, exploiting overlapping problems • Implemented as HMM using Viterbi algorithm Structure withMinimal Energy

  45. Learning Algorithm • Constraints have form: For all incorrectly predicted structures Si, in future selection of the parameters w: Energyw (Si) > Energyw (correct structure) Constraints are linear in the energy parameters. • If feasible, could solve with linear programming • In general, solve with Support Vector Machines (SVMs) • Energy(Si) ¸ Energy (correct structure) + 1 -i (i¸ 0) • Find parameters w minimizing ½ ||w||2 + C/n i=1i n • Provides general solution using soft-margin criterion

More Related