Protein Folding: Interrelation between Secondary and Tertiary Structure Determination

Protein Folding: Interrelation between Secondary and Tertiary Structure Determination Karl F. Freed James Franck Institute and Department of Chemistry University of Chicago KITPC, Beijing China, July 29, 2009.

Human, Dog, Rat, Worm Genome Projects: Obtain genes which code for protein sequences www.ncbi.nlm.nih.gov/Genbank/genbankgrowth.jpg www.genomesonline.org/images/gold_s1.gif

Proteins:The primary functional biomolecules insulin: cytochrome c: hormone heme group glucose electron transfer levels ribonuclease: lysozyme: myoglobin: cleaves RNA cleaves oxygen carbohydrates storage hemoglobin: oxygen transport glutamine synthetase: synthesize glutamine antibody:recognize/target foreign bodies Our goal is to determine the folded structures Function follows from structure

3D STRUCTURE “Unfolded State” Many conformations Amino acids - polar, neutral, charged, or hydrophobic Protein Folding Problem SEQUENCE Folding 0.00001 – 10+ sec “Native State” Responsible for function

Sequence Unfolded State, “statistical coil” Why haven’t we solvedThe Protein Folding Problemafter 40+ years? “Mother Folding” • Two aspects: • Predict pathways • Predict structure Sequence determines structure

Give me an aa sequence;I’ll produce a pathway and the final structure.Why so difficult? • We’re not smart enough • It’s a very complex system • …

Give me an aa sequence;I’ll produce a pathway and the final structure.Why so difficult? • Complex problem: • Too many atoms (not enough computing power) • Force fields inaccurate (pairwise interactions inadequate) • Complex interplay between secondary and tertiary structure formation (local vs. long-range structure) • High degree of folding cooperativity • Averaging doesn’t work (no mean field models) • H2O solvent is difficult to treat • Don’t know all the rules? • Not enough information? • Reductionist models (e.g. H/P) often too simple

What are the fundamental principles needed to predict pathways and structures?

- Predict structure from sequence: Common Strategy: Use sequence similarity (homology) to proteins with known structures. Use parts or full structure to predict new structure (homology). …ALEVADKVAIYTRE… …ALEVAEKVAIWTRE… ? Two aspects of “The Protein Folding Problem”: - Mechanistic studies: “How does it get from the U-state to the N-state?” Successful when have homology, but side-steps The Question: “What are the Principles?”

Form hydrogen bonds between all the nitrogens and oxygens Pack all atoms with no spaces Hydrophobic (greasy) atoms in the center away from water H2O a helix bstrand H2O H2O H2O H2O H2O van der Waals interactions Why do they fold into specific structures?

What level of representation is needed? Ca Cb Beads-on-a-string

Monomer structure & miscibility of polyolefins K. F. Freed and J. Dudowicz, Adv. Polym. Sci. 183, 63-126 (2005).

y f Mutuallystabilizing Major themes and challenges in protein folding 2) Satisfy main-chain hydrogen bonds and form secondary structure. 1) Polymer bends in certain ways 3) Bury hydrophobic residues and pack the atoms Must satisfy 1, 2 & 3 simultaneously

Vast Conformational Search: Levinthal Paradox Each , pair: 3 conformations ~3n ~100.5n conformations/amino acid e.g., 1050 states for a 100 a.a. protein Random, unbiased search would take 1035 s even with fsec sampling How does a protein find the time to fold? Polypeptide backbone is flexible, adopting specific conformations Poly-Proline II basin Beta basin y preferences Helical, turn basin Ramachandran Map f • And we have to search too!

Step 1: Remove solvent – ( hydrophobic, screening, etc terms) Step 2: Replace side chains by C atoms. y f Cb All-atom protein Only main chain heavy atoms and Cbatoms. Backbone f,y angles are the only coordinates. 2N total for N aa’s (need ~10x3N for x,y,z of heavy atoms) What info is needed to fold proteins? all-atom  dihedral angles All-atom protein + solvent Simulation would take decades

y f Reduced representation: Side chains are only Cb Retains the 3 themes Big Challenge: Retain sequence information lost with removal of side chains

y f -basin PPII-basin 1) Proteins bend in certain ways Sampling in dihedral space:f-yangles and Ramachandran BASINS e PP2 b Extended f&yshow very strong preferences for certain regions of the f-y map, called Ramachandran basins. (due to steric & electrostatic interactions) aL Helical Where to get this information? -basin

Phi Psi From computer simulations? • But, force fields can vary widely Zaman et al. JMB 2003

PDB y f Data mine protein data base (PDB) of crystal structures: Extract the distributions for each type of amino acid THR ALA y f

Ramachandran Map of ALA with neighbors=ALA, ASP ALAASP ALAALA Map depends on neighbor type and conformationStrong correlations in sequence

trimer library PDB Filter by residue type & conformation trimer move Move-set uses highly selected trimers from PDB Specifying side chain type and backbone geometry implicitly includes all-atom side chain information

PDB Knowledge-based energy function:Assign interaction energies according to the observed distances in the PDB ProbPDB(rij) Probability of finding 2 atoms some distance apart e.g. Dist(Caala – Caval). EnergyPDB(rij) = -ln( ProbPDB(rij) )

Statistics Machine Learning QVD Single residue and nearest neighbors Chou and Fasman (1974) Neural Network Jones (1999) Dor & Zhou (2007) Secondary structure probability conditional on identity of neighbors Multiple sequence alignments Support Vector Machines Kim (2003) Hu et al. (2004) McGruffin and Jones (2003): PSIPRED Dor and Zhou (2007) Current Accuracy of Best Methods is 75 – 80 %: Good for , but poorer for  and / Secondary structure prediction methods… + SASA, …

Attempt to integrate secondary and tertiary structure Neural Network + tertiary Meiler and Baker (2003) Add 3-D models to Machine Learning Folds target & homologs (~ 5% gain) What prevents accuracy of secondary structure prediction from reaching ~90% ? Secondary structure often depend on long range interactions, i.e. tertiary structure This is supported by the following studies: • The same fragment from different parts of protein G forms varying secondary structures Minor and Kim (1996) • Secondary structure prediction accuracy decreases with increasing contact order Kihara (2005) Pan et al. (1999) Jacobini et al. (2000) Zhou et al. (2000) Ikeda and Higo (2006) • The same sequence fragment can be found in multiple native secondary structure types

Tertiary structure Secondary structure What we do differently: Couple secondary and tertiary structure during the folding process Restrict possible secondary structure as the chain folds “Eliminate all other factors, and the one which remains must be the truth.” A. C. Doyle, The Sign of the Four (1890)

ItFix Algorithm Keep track of where the trimers came from (Helix, extended, coil) Starting configuration (no SecStr restriction) Stage 1. Fold 100 times with trimers from entire PDB Round 1 Stage 2. Refine each Stage 1 end structure Eliminate an option (H,E,C) if PHelix, PCoil < 10%, PExt < 5% Round 1: Filter by residue identity only Stage 1. Fold 100 times with restricted library Round 2 Stage 2. Refine each Stage 1 end structure Round 2 Stage 1. Fold 100 times with restricted library Round 3 Stage 2. Refine each Stage 1 end structure Rounds 3,4 … Repeat until no further fixing is possible… Stage 1. Fold 100 times with restricted library Final Round Stage 2. Refine each Stage 1 end structure helix Not(Strand) Coil: B, G, I, S, T, N strand Not(Helix) Predicted secondary structure and 3-D model Trimer library: Full PDB

B) Iterations mimic steps in folding pathway Major pathway 1 b1 b2 helix b4 b5 310b3 Unfolded state Round 1 Round 0 Rnd 1 0 1 Rnd 0 b1-b2 hairpin Round 2 + b3 +helix Rnd 2 0 1 Rnd 1 Round 3 + b4 + b3 Rnd 3 0 1 Rnd 2 Round 4 + b4 +helix 0 1 Round 6 +310 helix 0 Rnd 9 1 Round 9 Pred’n + b5 b1 b2 helix b4 b5 310 b3 0 73 residue index 1 Native state Ca-RMSD 3.1 Å Figure 4

1af7 prediction 2.9 Å 1r69 prediction 4.2 Å 1ubq prediction 5.3 Å 1af7 best 2.5 Å 1r69 best 2.4 Å 1ubq best 3.1 Å A B C Energy 1af7 1r69 1ubq C rmsd C rmsd C rmsd

Round 5 1af7 1di2 1r69 1b72A 1 Round 0 Round 0 Round 0 Round 0 0 1 Round 1 Round 1 Round 1 Round 1 0 1 Round 2 Round 2 Round 2 Round 2 Secondary Structure frequency 0 1 Round 4 Round 3 Round 3 Round 3 0 1 Round 6 Round 4 Round 5 Round 4 0 1 Round 6 Round 7 Round 6 Round 8 0 residue index

1AF7 3.4 Å RMSD 1TIF 5.4 Å RMSD 1SAP 7.8 Å RMSD

Novel Aspects • Predict 2° & 3° structure without using homology • Use principles of protein structure and folding: • couple 2° & 3° structure formation • sequential stabilization • Iterative fixing to reduce the search. • Use a Cb representation • Potential function: orientational & 2° structure dependence in a Cb model • Q8-level 2° structure, (f,y) prediction, can outperform PSIPRED • Outputs pathway information

We couple the formation of tertiary structure and the identification of secondary structure to improve both types of structure prediction Goal: Predict secondary structures – Exceed the 80% barrier… without homology Future applications: Improved ab initio structure prediction Assist in identifying and ranking of threading targets Future improvements: Better energy functions = better folding and predictions Bigger proteins – how far can we push secondary structure accuracy even when we fail in determining folded structure Conclusions

Acknowledgements Prof. Tobin Sosnick JoeDeBartolo - ab initio folding, secondary structure prediction Dr. Andres Colubri – Folding simulations, software Dr. Abhishek Jha (MIT) – coil library, unfolded state, a, b propensities, structure refinement, electrostatics James Fitzgerald (Stanford) - Statistical potentials, torsional dynamics Prof. M. Zaman (UT Austin) – Simulations on peptides All-atom statistical potential Dr. Min-yi Shen, Prof. A. Sali, UCSF Funding: NIH, NSF, Burroughs Wellcome Fund

Protein Folding: Interrelation between Secondary and Tertiary Structure Determination