G53BIO – Bioinformatics cs.nott.ac.uk/~jqb/G53BIO

G53BIO – Bioinformaticshttp://www.cs.nott.ac.uk/~jqb/G53BIO Protein Structure Prediction – Part 2 Dr. Jaume Bacardit –jqb@cs.nott.ac.uk Prof. Natalio Krasnogor – nxk@cs.nott.ac.uk Some material taken from “Arthur Lesk Introduction to Bioinformatics 2nd edition Oxford University Press 2005” and “Introduction to Bioinformatics by Anna Tramontano”

Outline • Prediction of the 3D structure of proteins • Assessment of PSP quality: CASP • Summary

3D Protein Structure Prediction • Approaches for 3D PSP • Template-Based Modelling • Ab-Initio methods • State-of-the-Art methods • I-Tasser • Rosetta

Approaches for 3D PSP • Some PSP methods try to identify a template protein and then adapt the structure of the template to the target protein  Template-based Modelling • Other methods try to generate the structure of the protein from scratch (Ab Initio Modelling) optimizing some energy function that models the stability of the protein, in case that no template can be identified

Pipeline for Template-based Modelling • Typical steps • Identify the template (next slide) • Produce the final alignment between the residues of target and template • Determine main chain segments to represent the regions containing insertions and deletions (gaps in the alignment) and stitch them into the main chain of the template to create an initial model for the target • Replace the side chains of residues that have been mutated (mismatches in the alignment) although it is possible that the conformation in the template is still conserved • Examine the model to detect any serious atom collision and relieve them • Refine the model by energy minimization. This stage is meant to adapt the stitched segments to the conserved structure and to adjust the side chains so find the most stable conformation

Loop remodelling

Template identification • Can we find a sequence with known structure and high sequence identify with the target? • Homology Modelling • Still, there is a template (structure similar to that of the target) but it has poor sequence identity. We need to identify it by other means • Fold recognition • Profile-based methods • Threading methods

Profile-based Methods • Aim is to construct 1D representations (profiles) of the structures in our fold database • Afterwards, when a target sequence comes, we construct its profile and check our database for the most similar profile • That is, instead of aligning amino acid sequences, we align structural 1D profiles

How to construct the profile? • We choose a series of structural properties of residues • Most frequent secondary structure state • Alpha helix, Beta sheet, other • Solvent Accessibility • < 40Å2, >100Å2, intermediate • Hydrophobic/polar • For each amino acid, we decide to which category it belongs based on statistics computed on a large database of structures

How to construct the profile? • Now the sequence for each protein in our database will have a new structural representation • We need to predict SS and Acc for the template

Threading methods • We start with compiling a catalogue of unique folds (filtering out repeats) • Afterwards, we evaluate how likely it is that the target sequence adopts each of the folds, and how (alignment) • Name is a metaphor taken from tailoring, as we are are trying to fit the sequence (a thread) through a known structure • We will choose the template (and alignment) that has the lowest (estimated) energy

Threading methods • Energy estimation needs to be simple and fast • As we need to evaluate all possible folds and alignments • Energy is the product of all the pair-wise interactions ocurring in a protein • Thus, the energy estimation will be computed as the sum of the energy terms for every pair of residues in the protein • How to compute the energy interaction for a given pair of amino acids?

Pair-wise Energy estimation • Boltzmann’s equation states that the probablity of observing a given event depends on its energy • P(x) = e(E(x)/KT) • If we reverse this equation we get: • E(x) = -KT ln[ P(x) ] • We can compute P(x), for each pair of amino acids from a database of known structures as the frequency in which these amino acids are observed to be in contact

Alignment within threading • We still need to solve the problem of the correspondence of the residues in our template with those of the target • This is a very difficult problem, as a change in an alignment can have impact in the interaction with many residues • There is an exact (but costly) solution • Instead, most methods adopt an approximate method called frozen approximation • When evaluating the possibility of assigning one of the amino acids of the target to a certain position in the template, instead of computing the interactions with the rest of the target residues, we will use those of the template

Frozen Apporximation

Aligning target and template • Crucial step before generating the initial model • It is possible, specially for homology modelling, that the best sequence alignment does not correspond to the best structural alignment • That is, finding the best correspondence between the coordinates of each amino acid of target and template • In this case, a better alignment process needs to be performed, to do se, we can use • Information derived from the template’s structure • Predicted for the target

Aligning Target and Template Wrong alignment. Some atoms are too close (big circle). Some atoms are too far (small circle) Correct alignment after shifting

The poor man approach to homology modelling • To find templates • PSI-BLAST • 3D Jury. This program is a meta-server. That is it asks many other servers what templates would they choose and then produces a consensus decision based on the answers of the servers • To produce a model of a protein given a template • MODELLER. Very popular homology modelling package. Free for academic use • To refine the side-chain conformations • SCWRL

Ab-Initio modelling • In general this kind of modelling is still quite primitive when compared to homology modelling • However without a target it is the only choice • Pure ab-initio modelling is still very costly and ineffective but hybrid homology/ab-initio methods such as fragment assembly have better performance

Ab-Initio modelling • The most advanced ab-initio method is fragment assembly • Consists by breaking up the sequence in small subsegments of 3 to 9 residues and generating structure for these segments based on a large library of known fragments • Decoys are generated from all possible combinations of fragments • An energy minimization process is applied to all decoys. • Decoys are clustered and the final models are selected from the center of the largest clusters

Energy minimisation Energy minimization is not easy. We may need to go uphill before we can reach the lowest energy conformation

Energy functions for ab-initio methods • Energy function needs to take into account the interactions of all atoms of all amino acids • Many different types of energy sources • Covalent bonds • Angles and torsions of bonds between atoms • Van der Waals interactions (repulsion/attraction) • Energy of charged atoms • Interactions with solvent • Hydrogen bonds • Exact formulas are very costly, so generally PSP methods use knowledge-based potentials, computed from a large database of structures

I-Tasser • Prediction method from Zhang’s group • Fully automated server, without any human intervention • Steps • Template identification • Structure assembly • Atomic model construction • Model selection

I-Tasser: Template Identification • MUSTER fold recognition method, used both for whole proteins (TBM) or for fragments (Ab Inition) • Profile-based fold recognition • Secondary structure • Structural frament profile • Solvent accessibility • Backbone torsion angle • Hydrophobicity • For the most difficult targets, a meta-server that combines the outputs of various methods is used

I-Tasser: Structure assembly • Generation of a preliminary model with only coordinates for Cα and sidechain positions • Using the template as starting point where possible and ab-initio methods for amino acids without alignment • Two iterations of refinement • 1st based on templates • 2nd based on clustering the models of the previous iteration and using the centroids of each cluster as starting points

I-Tasser energy function • Knowledge-based statistics of • Cα – sidechain correlation • H-bonds • Hydrophobicity • Spatial restraints of templates • Contact Map prediction from SVMSEQ • 9 predictions included, combinations of • Contacts between Cα, Cβ or side chain centers • Contact cut-offs of 6, 7 or 8 Å

I-Tasser atomic model construction • Full-atom models are constructed from the approximate models produced by the cluster centroids • 1st the backbone is matched with a large library of template fragments with high resolution structure • Then full-atom optimization occurs focusing on H-bonds, removing clashes and using the Charmm22 molecular dynamics force field

I-Tasser model selection • Several full-atom models are generated from each cluster centroid • Models need to be ranked to select the best one • I-Tasser uses a weighted sum of • Number of H-Bonds / target length • TM-score (metric to compare structures) between the full-atom model and the centroid cluster

Rosetta • Predictor from David Baker’s group • It uses a massive distributed computing infrastructure (Rosetta@home) • For CASP7 in 2006 it claimed to dedicate up to 104 cpu years/target • Template identification used a variety of methods depending on sequence identity between target and template • Different protocols for Template-Based Modelling and Free Modelling (fragment assembly) • 3 variants of TBM depending on degree of homology between target and template

Rosetta • Full-atom refinement protocol • Energy function based on • Short-range interations: Van der Waals energe, H-bonds and solvent accessibility • Long range interactions (dampening of electrostatic interactions) • Minimization through Monte Carlo with the following steps: • Perturbation of a randomly selected angle from the backbone • Optimisation of side-chain rotamer conformations • Optimisation of both backbone and sidechain torsion angles

PSP and CASP • PSP has improved through the years. This improvement has been assessed mainly in CASP • CASP = Critical Assessment of Techniques for Protein Structure Prediction • It is a biannual community exercise to evaluate the state-of-the-art in PSP • Every day for about three months the organizers release some protein sequences for which nobody knows the structure (128 sequences were released in CASP8 in 2008) • Each prediction group is given three weeks to return their predictions. 24 hours are give to automated servers • Then at the end of the year experts meet in a place close to the sea to discuss the results of the experiment 

CASP categories • Several categories of experiments are assessed in CASP • Template-Based Modeling (Homology and fold recognition) • Free Modeling (no template i.e. ab initio) • Contact Map prediction • Functional sites prediction • Domain prediction • Disordered regions • Quality assessment • Categories have changed through time • SS prediction is not assessed anymore after CASP4 • Homology modeling and fold recognition merged into TBM

Progress through CASP 1. Computers help structure prediction: no more paper models (From Nick Grishin’s Humans vs Servers presentation in CASP8) 2. Knowledge-based potentials work better. 3. Local “threading” and fragment assembly (Baker) 4. Averaging and consensus methods work: meta-servers (Ginalski-Rychlewski) 5. Sequence profile methods are as (or more powerful) than threading: (Sốding) 6. Jamming poorly similar templates together helps: (Skolnick-Zhang)

Assessment of 3D PSP • How can we quantify how good is a model? • That is, how similar is a model structure to the actual (native) one? • We will see this in depth when we cover the protein structure comparison topic, later in the module • Now we are just going to describe the most popular metric, GDT-TS

GDT-TS • Global Distance Test – Total Score • This measure tries to produce a balance between good local and global similarity of structures (unlike RMSD) • If a measure only takes a global point of view, good models that only fail badly in a few amino acids could be discarded

GDT-TS steps • All segments of 3, 5 and 7 consecutive amino acids from the model are superimposed to the actual structure. • Each of them will be iteratively extended while they are good enough • Good enough = Distance between all residue pairs (represented by their Cα atoms) is less than a certain threshold • A final superposition includes the set of segments covering as many residues as possible • Segments do not need to be continuous

GDT-TS metric • The process of superposition is performed four times, using thresholds of 1, 2, 4 and 8 Å • The reason for including 4 different thresholds is to have a metric which is good both for high accuracy models and for approximate models

GDT-HA • HA = High Accuracy • Set of thresholds in GDT-TS changed to 0.5, 1, 2 and 4 • For high accuracy GDT just provide a crude approximation (backbone). So other measures are taken into account • H-bonds • Position and rotation of sidechains • Clashes of atoms

Contact Map prediction in CASP • Contact Map is assessed using the targets in the Free Modelling category • Also, only long-range contacts (with a minimum chain separation of 24 residues) are evaluated • Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction • The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}

Contact Map prediction in CASP • From these L/x top ranked contacts two measures are computed • Accuracy: TP/(TP+FP) • Xd: difference between the distribution of predicted distances and a random distribution

CASP9 results These two groups derived contact predictions from 3D models http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf

Other CASP prediction categories • Functional sites prediction • Predicting which residues of a given sequence are those that perform the chemistry of the protein • Bind to other proteins/compounds • Methods can use whatever information they can infer to perform this prediction • However, most predictions can be performed simply by homology  • Domain prediction • Domains = quasi-independent subsets of a protein, that fold on their own • Their prediction follows a simple divide-and-conquer motivation • It is much easier to create separate models for the different domains of a protein

Disordered regions prediction • Regions of a protein that do not fold into a unique pattern (no coordinates in the PDB file) • 75% of mammal signaling proteins are estimated to contain long (>30) disordered regions, and 25% of the total amount of proteins may be fully disordered • Thus, it is useful to predict from the sequence if that is the case

Disordered protein 2K5K

Quality assessment prediction • Given a model, can we predict how good it is (without comparing it to the native structure)? • Overall and per-residue model quality • Prediction was done based on the models from the server category • Two families of methods • That perform predictions for individual models • That take a set of models and give predictions based on consensus agreements

Summary of topic • Importance of PSP • Many different types of prediction included in the PSP family • 3D PSP • Prediction of amino acid structural features • Others • Families of 3D PSP • Template-based Modelling • Free modelling

G53BIO – Bioinformatics cs.nott.ac.uk/~jqb/G53BIO