Plegamiento de proteínas: Una perspectiva bioinformática.

Plegamiento de proteínas:Una perspectiva bioinformática. Ugo Bastolla, Red Nacional de Bioinformática y Centro de Astrobiología (CSIC-INTA) Universidad Politécnica de Madrid, 14 de enero 2003

Proteins as interdisciplinary molecules • Proteins are evolving molecular machines, at the border between Physics and Biology. • They are molecular machines that obey the laws of statistical mechanics. • They are evolving machines, produced through the action of mutation and natural selection. • Bioinformatics integrates both sources of information to predict biological properties. Thermodynamics sheds light on protein evolution, and evolutionary considerations sheds light on protein folding.

Proteins are polymers formed by 20 amino-acid types bound by peptide bonds. Soft degrees of freedom: phi-psi angles

Torsion angles cluster at values corresponding to regular local structure (secondary structure), stabilized by hydrogen bonds.

Hierarchical organization of protein structure

Many proteins (e.g. antibodies) are formed by several, almost independently folding units called domains.

Protein Folding AMTYHLDVVSAEQQMFSGLVEKIQVT….. Most proteins fold spontaneously in a well defined three dimensional conformation, the Native State. It is believed that the Native State is the state of minimal free energy available to the protein plus solvent system. This depends on the state of the solvent, ex. on temperature and pH.

In proper conditions, the configuration with minimal effective free energy and its neighbors have Boltzmann probability close to one They are always observed in independent experiments. They represent the Native State. Transition Unfolded Native Free energy Statistical mechanics of protein folding N residues, exponentially large (eaN) number of conformations. Boltzmann distribution in configuration space: Prob.(C)  exp(-E(C)/kBT) kB is the Boltzmann constant, T is absolute temperature, the effective free energy depends on the state of the solvent (averaged out) through temperature, pH, presence of denaturants…

Lattice models of protein folding • Exponentially large number of conformations, Monte Carlo simulations. • Well designed sequences: fold fast to the lowest free energy state, stable thermodynamically and against mutations. They have well correlated landscape. • Qualitative features reproduced, no experimental comparison possible.

The normalized energy gapa gives a quantitative measure of energy landscape correlations E(C)-E(C0) >a(1-q(C,C0)) |E(C0)| Random sequences Slow folding, low stability, smalla Designed sequences Fast folding, high stability, large a

Molecular Dynamics • Model: all atoms in the protein • Solvent either explicit or implicit • Molecular dynamics simulations d2xi/dt2=Fi(x1… xN) t=10-12 sec. • Force field ideally from “first principles” (but simplifications are needed!). Ex: CHARMM, AMBER • Very useful to model the functioning of an enzyme, but useless for folding prediction: Time scales are too long, simulations can be trapped in energy minima, and it is not even clear whether the model is accurate enough.

The holy gral of protein folding Develop a model simple enough to allow computation, yet realistic enough to be comparable with experiments. The only a-priori reliable model needs quantum interactions (for instance, interactions between aromatic amino acids) and all atoms of the solvent (PROBLEM!). Simplest models have 2N torsion angles degrees of freedom. The number of possible conformations is O(e2N), incredibly huge even for a quite reduced chain (impossible to compute all of them).

Homology modelling • Just use biology, not physics! (homology=common origin) • Proteins with more than 25% sequence similarity always have very similar structure, because structure is very conserved in evolution. • Align query and template • Build the backbone from aligned template • Build non-aligned regions (loops) • Build side chains. The more similar the sequences, the more similar the structures and the better the model.

Homology models tend to be rather good in conserved regions, but they are poor in more variable regions (loops). They are only reliable if sequence similarity is above 25% (this threshold has been decreased due to better alignment techniques), whereas most protein pairs have lower similarity.

The bioinformatic approach: look at known protein structures • Related proteins with the same fold have typically low sequence similarity: Their similarity can hardly be recognized only aligning the sequences. • Score: how suitable is a structure template to a query sequence? (Effective energy function) • Recognize the known structure which best fits an unknown sequence • No physical derivation for the scoring scheme, but thermodynamic estimates are sometimes possible

Reduced representation of proteins We represent protein structures as contact maps: Cij={ Similarity is measured as the fraction of common contacts, or overlap: q(C,C’) = Alternative structures of sequence A={A1...AN} are generated by aligning A without gaps with all structures in the PDB (gapless threading). The energy is assumed of the form E(C,A)/kBT=Sij Cij U(Ai,Aj) depending on 210 parameters U(a,b). 1 if ij contact 0 otherwise Sij Cij Cij’ max(SijCij,SijCij’)

Effective energy for simplified protein models We have optimized the parameters of a contact energy function such that the Native State has the lowest energy and the energy landscape is well correlated for most independent proteins in the Protein Data Bank. Our optimization method is based on the maximization of the Boltzmann average of the similarity with the native state: Q(A) ~ SC exp(-E(C,A)/kBT) q(C,Cnat) When this parameter is maximal (Q ~ 1) the native state has lowest energy and dissimilar states have high energy (the energy landscape is well correlated). This can be achieved for nearly all proteins in the PDB

Effective energy applied to crystal structures Prediction of unfolding free energies using crystal structures and effective energy functionDG/NkBT = Enat/NkBT - s The Native States have lowest energy and the energy landscapes are well correlated. The resulting normalized energy gap a (0.2-0.8) is much higher than for random sequences (<0.1) and increases with chain length

The main contribution to the energy parameters comes from hydrophobicity

A facility for protein structure prediction at CAB(http://www.cab.inta.es/~CAFASP/) The PROTFINDER algorithm looks for the structure in the PDB database which better aligns (with gaps) to the query sequence. It took part to CAFASP4. It is available through a web server realized and cured by Alain Lepinette of CAB.

Scoring function: Sequence-structure alignment a(i) Score= -Sij C(a(i),a(j))U(Ai,Aj)- S0Lali -G0Ngaps-G1Lgaps Contact free energy Configurational entropy loss S0 for each aligned residue Gap penalties G0 (create) and G1 (extend) Sequence homology information is not used. A semi-deterministic algorithm used to generate candidate alignments.

Fold recognition The ability to predict protein structures depends crucially on the most similar structure available in the database, qmax. Very similar structurepresent in the database correctly selected on the basis of the energy. No structure above a threshold of similarity almost random prediction. The high similarity needed is frequent in proteins of detectable homology, but not in very distant homologous.

Sequence-structure alignments obtained through ProtFinder are very similar to those found in databases of protein alignments (PFAM)

The CASP experiment evaluates protein structure prediction methods

Stability of orthologous proteins Related proteins with the same fold have typically low sequence similarity. What are the common features of their sequences? How similar are their thermodynamic properties?

With our tools we can compare thermodynamic properties of homologous proteins. We estimate two key parameters: the folding free energyDG and the normalized energy gapa. We apply our energy function to families of orthologous proteins predicting their Native Structure. In all cases, this coincides with the structure of the closest analog in the PDB, despite our algorithm does not use the information on sequence similarity.

List of organisms: List of genes: • Free-living:B.subtilis, B. anthracis, C.crescentus, , D.radiodurans, E.coli, E.acidophylus, H.influentiae, L.lactis, L.monocytogenes, L.innocua, M.tubercolosis, M.smegmatis, N.meningitis, P.multocida, P.aeruginosa, P. putida, R.loti, R. meliloti, S.typhimurium, S.aureus, S.pyogenes, S.coelicolor, Synechococcus, T.pallidum, V.cholerae, X.fastidiosa, Z.mobilis • Intracellular: B.burgdorferi, B.aphidicola (APS, BPS, SGR), C.jejuni, C.pneumoniae, C.thrachomatis, H.pylori, M.capriolum, M.genitalium, M.pneumoniae, M.leprae, R.prowazeki, U.parvum, Y. pestis, W.glossinidia, Wolbachia sp. • Thermophyles: A.aeolicus, B.stereothermophylus, T.maritima, T.aquaticus • Archea:A.pernix, A.fulgidus, M.Jannaschi, M.thermoautotrophicum, P.furiosus • ATPE ACKA • AROQ COAD • DDL DUT • EFTS FLAV • FOLA FTSJ • PDF PTH • PTHP RL14 • RNH RNPA • TRXA TRXB • TPIS TRPA • DNAK

Protein folding thermodynamics depends on hydrophobicity. More hydrophobic sequences have more negative folding free energy (they are more stable against unfolding), but they have lower energy gap (they are less stable against misfolding). Evolution has to look for a compromise between these properties! (Frustration)

Folding efficiency (normalized energy gap) is correlated with genome size. Smaller genomes, such as those of intracellular bacteria, have reduced folding efficiency. Possible misfolding problems are consistent with observed high expression of chaperones in these bacteria.

Intracellular bacteria The genomes of obligate intracellular organisms (organelles, endosymbionts, parasites) share important common features: • Very small genomes • High AT content ; • High hydrophobicity • Reduced population size; • Reduced folding ability of proteins; These features can be explained from the point of view of evolutionary theory

Our results show that the normalized energy gap a is smaller for intracellular bacteria than for free living bacteria. This fact can be explained (a) because intracellular genomes have mutation bias towards A+T, hence express more hydrophobic proteins; (b) because of the weaker selection experienced by intracellular bacteria due to their small populations. A smaller folding parameter implies that the occurrence of misfolding is much higher. This can lead to protein aggregation, very dangerous for cellular processes. To avoid aggregation, these bacteria express very high amounts of chaperones, proteins in charge of helping protein folding. The chaperone DNAK appears more stable in organisms with smaller genome.

What do sequences with the same fold have in common? Therefore, sequences with the same fold have a common hydrophobic fingerprint that coincides with the PE of the contact matrix. The evolutionary average HV correlates with the PE much more strongly than the PE of a single sequence. Spectral decomposition of the interaction matrix: E= SikCikU(Ai,Ak) ~ SikCikh(Ai)h(Ak) Sequences with the same fold have similar Hydrophobicity Vector h(Ai) (HV). The HV has large correlation r(h,c) with the Principal Eigenvector (PE) of the contact matrix Cij.

Bioinformatics • Biological information is accumulating at very fast pace. • Need of classifying this information for storing and retrieving (One could say that biology is the art of classifying!) • Protein structures: decomposition, structural classification, hidden evolutionary relationships. • Biological sequences: Identification of protein sequences (genes), classification, structure and function prediction. • Molecular interactions: reconstruction of metabolic networks and cellular regulatory networks (system biology) • Organisms: evolutionary classification (phylogeny) • Biological literature: classification and retrieving

Proteins are made of modules (domains) that are duplicated and combined in many possible ways to create always new molecules.

The Protein Data Bank (PDB) contains roughly 24000 protein structures, determined either by X-ray crystallography or by NMR spectrometry. Less than 4000 are different folds. The number of new folds (blue bar) is decreasing each year. Other classification schemes yield less than 1000 different folds Evolution uses a reduced number of folds for a large number of biological functions.

CATH structural classification: 813 folds (Topology level) (Thornton, Orengo)

SCOP Structural Classification of Proteins: 800 folds (Chothia, Murzin)

DALI: Algorithm and server for automatic classification of protein structures (Holm and Sander). It aligns protein structures minimizing the dissimilarity score: S=Sik | raik - rbik |/(raik + rbik) exp(-(raik - rbik)2/4r02) r0=20A The sum runs over C alpha atoms i,k. It generates the database FSSP of structurally similar proteins (S much smaller than for random pairs of structures, Z score criterion).

For each new structure: • Store it in the PDB with proper format. • Decompose it in domains; • Classify domains, discover new evolutionary relationships. • For each new sequence: • Find the gene sequences in the genome (easy for prokaryotes, very difficult for eukaryotes because genes are interrupted by introns). • Find homologous domains, infer structure and function. • Decide whether structure determination is worthwhile

Protein databases GeneBank: Protein sequences (not annotated), from genomic projects. SwissProt: Annotated protein sequences. Domain organization, structure, function, active site may be known from homology. Protein Data Bank (PDB): Protein structures

Sequence Alignment

Alignment is the main tool in Bioinformatics. It is justified by the fact that aligned elements have a common evolutionary origin (homology). Amino acids or nucleotides in evolution can be conserved, substituted (usually with minimal modification of the Native State), inserted or deleted. The last two processes generate gaps in the alignment. The score for an alignment a(i) between two sequences A1i, A2k is Score= Si S(A1i,A2a(i)) - G0Ngaps - G1Lgaps The 20  20 matrix S(a,b) is called Substitution matrix and is determined from aligned protein families. The most used are the BLOSUM62 and the PAM250 matrices. G0 is the gap opening and G1 is the gap extension penalty. The number of possible alignments grows exponentially with sequence length, but the optimal alignment can be found exactly with an O(L3) algorithm using dynamic programming (Needleman & Wunsch, Smith & Waterman). The optimal solution is often, but not always, the biologically relevant one. The gap parameter and substitution matrix used are crucial! One has to check the statistical significance.

Multiple Sequence Alignments • Multiple alignments of M sequences is an NP problem: no solution polinomial in M is thought to exist. Once the first two sequences have been aligned, in fact, the score for the next one has been modified! • The most used solution is implemented in the algorithm CLUSTALW, it consists in aligning the easy pairs first: • Align all pairs of sequences with a fast algorithm • Build a tree of their relationship • Start aligning accurately the two most closely related sequences (easiest). Represent both of them with a single profile. • Iterate, looking again for the two most closely related sequences or profiles.

Database search • Often, we do not need accurate alignments but just a list of database entries that are evolutionarily related to our query sequence. Most used algorithms for this purpose are BLAST and FASTA. • BLAST compares the query sequence to all sequences in a database like SwissProt or GeneBank in few seconds. For each pair of sequences, it finds all exact matches of length k, extends and combines them, and provides the P value that the matches are found by chance. • PSI-BLAST is an iterative procedure based on BLAST. • Find all sequences significantly related to the query. • Construct a profile (amino acid distribution per site) from the multiple alignment • Iterate the search using the profile as query. • In this way, very distant evolutionary relationships can be retrieved confidently. This method is very useful for protein structure prediction.

A B C D E F G Distance Phylogenetic trees Evolving species can be placed on the leaves of a phylogenetic tree. The time past since the last common ancestor of species A and B, d(A,B), is a distance allowing classification. This is based on the ultrametric property: all triangles have the two longest sides equal. Phylogenetic trees were once built by comparing external characters, but now they are built using macromolecules such as proteins, RNA and DNA.

The molecular clock • Empirical observation: the number of amino acid substitutions between two orthologous proteins (ex. Myoglobin) of two speices A and B is linearly correlated with their divergence time t(A,B). Fluctuations of the number of substitutions are small. • K(A,B) ~ a t(A,B) • If the divergence time is not known, the number of substitutions can be used to estimate it. K(A,B) can be obtained from the number of mismatches in the sequence alignment, using some model of evolution to correct for multiple substitutions. • Methods to generate phylogenetic trees range from deterministic clustering algorithms to optimization methods. The two most used are: • Neighbor Joining: Join the two closest sequences, recalculate distances, iterate. Very fast but not very accurate. • Maximal Likelihood: For a model of sequence evolution (independent sites needed!), calculate the likelihood of the observed sequences given the parameters and the tree. Exhaustive search of the ML tree is impossible, but approximate algorithms give good results.

Tree of seven replication proteins found in all bacterial genomes (using the BLAST algorithm), obtained with the Neighbor-Joining method. The number represent Bootstrap values (number of times, out of 1000, that the plotted branching is observed using a random subset of all aligned positions). Some groups (clades) can be confidently recontructed, for instance Proteobacteria and Gram-positive bacteria, but some divergences are too ancient and no similarity signal is found in their proteins.

Some problems with phylogenetics • The protein tree, which we reconstruct, does not always coincide with the species tree, if there has been gene transfer between species (frequent in bacteria) or gene duplication prior to species separation (paralogous proteins). • The molecular clock is known to hold for neutral evolution (when the properties of the protein do not change), but adaptations happen at a much faster rate. The substitution rate can vary in different branches also due to different mutation rate or generation time. When the rate is too variable, the estimates of branch lengths and the reconstructed trees are not reliable. • The number of substitutions K(A,B) can be reliably estimated from the number of mismatches when it is not saturated. • An indication of these problems is that different proteins usually give different tree topologies.

Plegamiento de proteínas: Una perspectiva bioinformática.

Plegamiento de proteínas: Una perspectiva bioinformática.

Presentation Transcript