Protein Structure Prediction Mason Bially
Types of Structure • Primary Structure • The linear amino acid sequence. • Secondary Structure • The local three-dimensional structure. • Defined by hydrogen bonding patterns. • Tertiary Structure • The global three-dimensional structure. • Defined in atomic coordinates. • The actual function. • Quaternary Structure • The arrangement of multiple proteins.
How do we find Secondary Structure? • Couple Algorithms: • DSSP (Original, Slight Errors) • STRIDE (Newer, Sliding Window) • Requires the primary and tertiary structure. • Because of this they are exact, not guesswork. • Finds hydrogen bonds. • Uses potential energy functions. • Based on amino acid locations and orientations. • STRIDE’s is slightly more accurate • Returns one of 8 types of secondary structure for each amino acid. • 3 helix types • 2 beta-sheet types • 2 turn types • and ‘other’
X-Ray Crystallography • Shoot X-rays through a crystal and depending on how the X-rays come back, angle and intensity, the structure can be determined. • Some proteins are challenging to crystallize (intrinsic membrane proteins). • Can handle arbitrarily large sizes.
NMR Protein Spectroscopy • Uses Nuclear Magnetic Resonance a phenomena by which atomic nuclei in a magnetic field respond to electromagnetic radiation by reemitting it. • Has difficulty with large proteins. • Works on almost anything. (Including proteins with unstable tertiary structure)
Why do we need Structure Prediction? • Experimentally Finding tertiary structure has problems. • Slow, difficult, hard. • Some proteins can’t be found experimentally. • We need to cover more ground, quicker. • Drug design. • Bioinformatics tool development. • More detailed Interactome information.
But isn’t it computationally hard? • Yes. • Secondary structure prediction. • Machine learning methods. • Tertiary structure prediction. • Homology Modeling • Fold Recognition (AKA Protein Threading ) • From scratch (AKA de novo, AKA ab initio)
Basis for Prediction(Comparative Modeling) • Protein structure (Secondary and Tertiary) is evolutionarily more conserved than the DNA or amino acid sequence. • Structure is function; changing it would prevent the protein from doing it’s job. • Therefore proteins will probably share structure with each other.
Secondary Structure prediction • Early attempts. (~60%) • Chou-Fasman • Uses the probability of a secondary structure containing an amino acid. • GOR • Bayesian inference applied to the same basic idea. • Machine learning methods. (~70%) • Neural networks. • Support vector machines. • Hidden Markov models. • Future. • Secondary structure is also based on the environment the protein is folded in. • Including this metadata to attempt to improve methods.
Homology Modeling • Requires primary structure and a template tertiary structure. • Relies on the idea that if one protein has a specific structure, so do other proteins. • Only works with relatively similar sequences. • Sequence identity above 50% is high quality. • Low quality x-ray crystallography. • Sequence identity above 30% is medium quality. • Anything lower degrades rapidly. • Limited by availability of suitable templates. • Limited by the ability to accurately align and choose distant templates. • Sometimes function/structure will diverge for seemingly similar targets and templates. • Happily generates models against incorrect templates.
Homology Modeling • Template selection and Sequence alignment • Crucial, but relatively simple if a similar sequence exists (BLAST). • For edge cases: • PSI-Blast, HMM or profile-profile alignment based. • Model Generation • Multiple methods. • Construct the model by placing the amino acids where the aligned template suggests. • Then refine by going back to the chemistry/physics and fixing errors. • Model Assessment • Make sure the resulting fold is correct. • Detects errors in alignments and template selection. • Sometimes chooses the best of many potential models.
Fold Recognition(AKA Protein Threading) • Requires primary structure and a library of tertiary structures. • Relies on the idea that there are (relatively) few folds (tertiary structure) of proteins. • Often feeds final structure back to Homology Modeling techniques as template to get final model. • Can use a number of different scoring algorithms. • Most popular is free energy. • Attempts to find which templates in the library minimize the scoring algorithm • Threading • Dynamic Programming. (Optimization technique) • Machine Learning. • Often finds a large number of results.
How do we know these models work? • CASP (Critical Assessment of Techniques for Protein Structure Prediction) • Every two years. • Tests blind prediction algorithms. • In many different categories. • Since 1994. • Other variations.
Future • Mix it all together! • Including evolutionary information. • Improves alignment. • Helps find better folds. • Structural information. • Predicted secondary structure can help. • Mixing with ab initio/de novo methods.
Questions? • COMPUTATIONAL STRUCTURAL BIOLOGY Methods and Applications • By Torsten Schwede and Manuel C Peitsch • Images from Wikipedia or sources.