Protein Structure and Function Chem/Biol 8360; Fall Semester 2009 (mini-semester I) Class 6a: “Structural Predictions an

Protein Structure and FunctionChem/Biol 8360; Fall Semester 2009 (mini-semester I) Class 6a: “Structural Predictions and Modeling” September 3, 2009 Basic Concepts in Molecular Modeling of Proteins Lecturer: Bob Wohlhueter; bobwohlhueter@earthlink.net

What are the goals of modeling?1. To visualize a realistic representation of the 3-dimensional structure of a protein molecule.2. To provide a caricature of protein 3D structure to emphasize secondary structures and their dispositions, binding sites, electrical potentials, surfaces, etc.

How realistic? Modelers often purposefullydistort the representation:- atoms occupy their van der Waals radius, but “space-filling” models occlude views of internal molecular space. “bond-only” or “ball-and-stick” are popular compromises.- classes of atoms are often omitted: H's, all side-chain atoms, waters, inorganic ions, etc., to simplify the view.

- π-bonds, lone-pairs, even H-bonds are often omitted, or reduced to “bond-order” representation.- structures are often modeled in vacuo, without solvation, and at 0ºK (i.e. without kinetic energy)- structures are deliberately stylized to remove detail in order to render some features more obvious (“backbone”, “ribbons”, secondary structural elements)

- 3D structures are typically represented on 2D media, like computer displays or paper (though with facilities to simulate three dimensional perception.)

But a sort of minimum standard of realismis strived for, and includes:- Cartesian coordinates of atom centers - average bond lengths and angles - actual dihedral anglesEven in highly simplified “ribbon” renditions,a smoothed curve passes through theactual coordinates of Cα atoms.

Computational Approaches: What arethe computable metric(s) that describe molecular “reality”? Protein modeling is based exclusively on classical, Newtonian physics:- atoms are considered point masses, with a variety of forces acting on them, as described mathematical functions related to “bonded” atoms and the angles subtended by 3 (bond angles) or 4 (dihedral angles) atoms, and by “non-bonded” interactions.

b i+1 i Θ χ rij i+2 i+3 i.e. for each atom, i, length b, bond angle Θ,dihedral angle χ, as well as atom j are considered j

- these mathematical functions typically have an “ideal value” (subscripted 0), and deviations add an energy penalty.- these “internal energy” terms are additive over each atom pair in the ensemble of atoms modeled.Accordingly, the most realistic structure istaken as that structure which minimizes thetotal “internal energy” (V, usually in Kcal)summed over all atom pairs included in the computation.

For example, as calculated by the Quanta/CHARMM algorithm: Note that dihedrals typically have 3 “preferred” angles, hence the integer n The terms ^12 and ^6 are the Lennard-Jones formulation for attraction and repulsion atvan der Waals radii The term in qi qj is thecharge-charge interaction,with dielectric constant εD H-bonds? Hydrophobic interactions?

Mathematically viewed, the problem is tominimize a function of N2 terms (whereN = number of atoms considered) over“coordinate space”, R. [For display purposes,Cartesian, for computational purposes a setof “internal coordinates” bases on bondrotations are used. The two are interconvertible.]This is a prodigious computational task, withno analytical solution, which must thereforebe approached by numerical approximationmethods.

Simplifying assumptions commonly used to reduce the computational load: - “unified atoms” (vs. “all-atom”); the hydrogens of -CH3 and -CH2-account for nearly half the atoms in proteins; simulate with single point-mass and parameter set.- charge-charge interactions diminish with square-of-distance; apply “distance cut-off” (usually about 12 Å).- charge-charge interactions also diminish with dielectric constant (∝ 1/εD); approximate εD for protein interior.- ignore solvent (!?)

To solvate or not to solvate? - add explicit water in a “solvent box”: traditional and most robust; create a cube enclosing the protein molecule, with “periodic boundary” properties; treat waters explicitly, though usually as unified atoms.- some programs support a “water shell”, a sphere of waters surrounding the protein molecule, the outer perimeter of which is simulated as a stochastic tension.- some programs (“MacroModel” by Schödinger) simulate bulk waters by stochastic algorithm.N.B. “Water of crystallization” can be explicitly included incomputations.

Numerical approaches to find the minimum of a function, like V(R), are subject to the localminimum problem: In one variable, the problem iseasily visualized. Methods, like“Newton-Raphson” or “steepestdescent” start with a first-guessand proceed downhill.In two variables, such “error-surfaces” are more daunting.In N-variable space the problemseems intractable.

How do we know the values of the K-parameters in the expression for V(R)? The tasks of “parameterization” and the development of“force fields”:- through model studies with small molecules, including quantum mechanical analyses (see next slide)- parameters are robust for the atom-types found in amino acids, but often missing or weak for atom-types encountered in organic cofactors (substrates, inhibitors, etc.), and for metal ligands- there are now several sets of “force-fields” in common use (CHARMM, AMBER, DISCOVER, GROMOS) for biological macromolecules; these are regularly updated, and may use somewhat different versions of the V(R) equation

From “Computational Biochemistry and Biophysics”, Becker, MacKerell, Roux and Watanabe, eds.; Marcel Dekker 2001

So what's the point of it all?! Given the severe problem of “local minima”, the skeptic willhave observed:1) that de novo prediction of protein structure by this approach is hopeless! (It is for all but the smallest peptides.)2) that to have any chance of falling into the the “right” energy minimum (the global minimum?), one has to have nearly the right answer to 3D structure before one begins computations! (i.e. the starting “approximation” has to be very close to reality.)3) that you shouldn't believe anything that comes out of the computer unless you can verify it experimentally.

The Optimist agrees, but points out: 1) that the job of finding “nearly the right answer”is that of the x-ray cystallographers and NMR structuralists. We can get along with de novo predictions for first principles.2) that these use computational modeling to move their raw data (electron density maps and internuclear distance constraints) into refined structures amenable to visualization.3) that these workers have populated the public database with over a thousand, highly refined structures.4) that these structures, in which we have high confidence, can serve as the starting points for profoundly informative analyses of ligand binding, mutational changes, protein engineering – all subject to experimental confirmation.

Molecular Dynamics: - the “most realistic” models based on algorithms for energy minimalization, contain only potential energy, no kinetic energy. (Ironically both x-ray data and, especially, NMR data do have indicators of heat / atomic motion.)- MD analyses inject kinetic energy into such an ensemble of atoms, and follows the “trajectory” of each atom's movement as a function of time.- subsequent tracking of the trajectories of groups of atoms yields at dynamic picture of the protein (e.g. oscillations in geometry of binding sites, making and breaking of H-bonds, identification of transitions states and reaction coordinates, identification of more and less mobile domains)

Molecular Dynamics: Computational Schemas 1) kinetic energy is introduced by randomly assigning each atom a motion vector such that the total motion of the ensemble corresponds to a desired temperature according to Boltzmann. 2) each atom is allowed to continue movement for a discrete time period, then the (altered) forces on each atom is re-calculated, and a new motion vector assigned accordingly. And so on.3) to avoid sudden shocks, the system is gradually “warmed” up to the desired temperature over a period of time; the data collected during warm-up are thrown away, then a “production run” commences at constant temperature, and is carried out as long as feasible.4) the accumulated trajectory data (huge!) can be rendered by software as “movies” of molecular motion and/or graphs of, e.g., distance variation between any two atoms as a function of time.

How long do you need to track movements? To avoid unrealistically long movements along any vector, “timeslices” are usually limited to about 1 picosecond. The number ofps you need to collect depends on what sort of motion you wouldlike to witness. From “Computational Biochemistry and Biophysics”, Becker, MacKerell, Roux and Watanabe, eds.; Marcel Dekker 2001

Using Molecular Dynamics to explore conformational space: simulated annealing Recall that slope-descent methods lock one into a specific local minimum. MD provides a method for jumping over hills and valleysof the error surface. By heating the molecule up (say to 1000 ºK!), onecan take a series of “snapshots” at different times, and use theseas starting points for new minimizations (i.e. cooling to 0 ºK). The idea can be illustrated thus:

Conformational space can also be explored by random atomic displacements: - such “Monte Carlo simulations” randomly choose atoms in the ensemble to operate upon, then randomly displace these to a random degree (usually by bond rotations in order to preserve bonding geometries.)- MC simulations are computationally efficient, though they lose any real significance of the time evolution of protein movements.- such an approach to searching coordinate space has been useful e.g. in simulating paths of protein folding/unfolding.

“Docking” – predicting the structure of protein – ligand complexes Conceptually, the approach of evaluating the pair-wise atomic interactionenergies should apply to inter- as well as to intra- molecular interactions.- the bonding terms of the force equations are zeroed; only non-bonding forces contribute (including H-bonding).- parameterization may be inadequate, since ligand molecules are apt to contain atom- and bond- types not encountered in proteins. (This problem may be alleviated, if the ligand can be considered “rigid”.) - absent covalent connectivity information, the “coordinate space” issue becomes intractable. That is, if you have no idea where a ligand binds, the computation is much more complex. (“Gridding” algorithms are available available, which systemically moves the ligand around the protein surface, and attempts local energy minimizations at each point.)- to simplify computations, usually only localized interactions are considered (the rest of the protein molecule is “frozen”), precluding observations of conformational effects at a distance.

“Surfaces”: the utility of “space-filling”models - when the atoms of a model are represented as their true, van der Waals spheres, a smooth surface grid can be visualized. Since force fields contain partial atomic charges, such surfaces can be color-coded to reveal the charge topology of a protein.- solvent accessible surface (“SAS”) can be computed by rolling a spherical approximation to a water molecule over the protein and tracing the 3D surface described by its center. The technique reveals water-accessible crevices, etc.- SAS of an intact protein can be broken down residue-by-residue, and expressed as the % of each residue's surface that is exposed to solvent.- SAS of whole, multimeric proteins can be traced, then viewed subunit-by- subunit to visualize protein-protein interaction surfaces/

Homology Modeling: the back door to structure prediction Conceding the impracticality of predicting structure from sequence and our knowledge of chemical principles (ab initio prediction), wecan pose the following questions:- What if we have a protein whose 3D structure has been painstakingly worked out by crystallography or NMR, and we want to compare sequence variants of that protein?- Variants which might have arisen by natural mutation, evolution, protein engineering?- Can we assume that the available structure provides a sufficiently accurate “initial estimate” to justify application of molecular modeling tools to predict the structure of the variant(s)? The answer to the last question is yes, and this is the basic premisebehind the many approaches to homology modeling.

The question (and the validity of the premise)begs a series of more nuanced questions: 1) How much variation can be tolerated? Quite a lot, particularly by way of amino acid substitution, where the result involves adjusting the given structure to accommodate larger of smaller residues. Cases of helix-breakers (e.g. Pro) in the middle of anα-helix should arouse caution. 2) How about indels? Insertions or deletions of amino acids necessitate altering the backbone structure to accommodate them, and thus are more problematic than substitutions. For 3 to 6 residues, inserted at a location on the surface, there are fairly good “rules” for building such loops. As variants become more drastic, one may need to apply techniques to search coordinate-space more broadly than just energy minimization.

3) What about insertions of whole domains? This has been a common occurrence in evolution, and poses a serious question. If the structure of the added domain has been determined in another protein, modern software allows one to merge a model of the main protein with a model of the isolated domain. 4) What's the likelihood of finding a good homology template? With about 1350 high-resolution protein structures in the Protein Database, there is a very good chance that you will find a suitably homologous structure there on which to model the protein of interest to you.

5) How does one manage all this? The most widely used homology modeling software is MODELLER by Andrej Sali's lab at UCSF. It provides using multiple templates, if available; for using different templates in different regions; for performing localized conformational searches and energy minimizations. It is academically as a stand-alone program, or integrated into major commercial graphics software such as Accelrys' Insight II. There are also some good on-line homology modeling facilities, e.g. SWISS-MODEL. 6) How much credence can one give homology models? The greater the sequence variation between template and model, the more problematic. The burden of proof lies with the modeler: experimental corroboration of model predictions is most persuasive.

From the Biologist's Point of View: High-resolution modeling, at the atomic level, is essential to thechemist needing to analyze catalytic reaction mechanisms, ligandbinding, dynamics, etc. Modeling at a much lower resolution, at the level of “foldrecognition”, might suffice for many purposes: - recognizing distant homologies among proteins - protein phylogenetics based on “structurally informed alignments” - placing proteins into the “super families” - guessing function from structural similarities

The Phylogenetics Problem: With the advent of large-scale nucleic acid sequencing, “molecularsystematics” has become a powerful and sophisticated tool forthe study of evolution.Phylogenetic inference from closely similar sequences are quiterobust; as similarity decreases, inferences become less sound.(Note that protein sequences are more useful than DNA sequencesin this regard.)By about 25% similarity, we reach a “twilight zone”, where one cannot distinguish variation from sequence randomization.Algorithms for multiple sequence alignment (like CLUSTALW) fail in the twilight zone. Aligning two residues makes an implicit assumption that they evolved from a common ancestor; if the assumption is wrong, the inference of the state of the ancestor will necessarily be wrong.

The Recognition of Distant Protein Homologieshas had Profound Influences on our Grasp ofEvolution: 1) “Structurally informed alignments” have extended our reach back much further into evolutionary time. (And demonstrated that protein domains are the real “units” of protein evolution.)2) The flip-side is the recognition that there are only a few folding motifs that have passed the test of natural selection. This, in turn, has supported the classification into “fold families” and “super families”. (Protein evolution has proceeded by residue replacements which may alter local function, but rarely global structure.)3) The ability to predict low-resolution folding patterns based on sequence alone allows large-scale, protein sequence data arising from proteomics to assign proteins to families, and thus to make an educated guess at their function.

Happy birthday,Charles! from “Evolution”; Barton, Briggs, Eisen, Goldstein, Patel, eds., Cold Spring Harbor Press

Structural Classification of Proteins:[SCOP database; scop.mrc-lmb.cam.ac.uk/scop/ ] The current release of SCOP comprises 110,800 domains abstracted from 38,221 structures in the PDB. These sort into971 “folds”: Summarized in “Structural Bioinformatics”, Gu and Bourne, eds.; Wiley-Blackwell Given 3.5 billion years of protein evolution, and the enormous number of possible amino acid sequences (in, say, a 200-residue protein = 1.6 x 10260), this represents an amazing small repertory of folds that have proven sustainable. Evolution has been far more conservative with protein structure than with protein sequence!

For example, the famous TIM-barrel fold is the most numerous in the alpha and beta proteins class; it characterizes 33 super-families.Database annotations give a synopsis of common featuresat every level, for examples: On the super-family: The first 7 super-families contain similar phosphate binding sites. On the first and second families within it: The ribulose-phosphate binding barrel consists of a parallel beta-sheet barrel fold containing a phosphate-binding site. Several proteins display this fold, including histidine biosynthesis enzymes, tryptophan biosynthesis enzymes, D-ribulose-5-phosphate 3-epimerase, and decarboxylases Triosephosphate isomerase (TIM) [PubMed2204417] is the glycolytic enzyme that catalyzes the reversible interconversion of glyceraldehyde 3-phosphate and dihydroxyacetone phosphate. TIM plays an important role in several metabolic pathways and is essential for efficient energy production. It is a dimer of identical subunits, each of which is made up of about 250 amino-acid residues. A glutamic acid residue is involved in the catalytic mechanism [PubMed2005961]. The sequence around the active site residue is perfectly conserved in all known TIM's. Deficiencies in TIM are associated with haemolytic anaemia coupled with a progressive, severe neurological disorder

Assigning known structures to a “fold family” versus predicting folds from sequences: - the family classifications of SCOP are based on known structures. In such cases, assignment to classes involves the process of structurecomparison and alignment. Algorithms for this seek to align major topological elements – helices, sheets, loops – and treat each domain as a separate entity. Output is in terms of clustering the query structure by similarity to known families. [SSAP (www.cathdb.info), TM-ALIGN (zhang.bioinformatics.ku.edu), etc. provide such comparisons on-line; the SALIGN module of MODELLER, e.g. is a stand-alone program for this purpose.] - since “folds” tend to be a property of domains, recognizing domains is a usually a prerequisite task. There is no hard definition of “domain”, and, accordingly, algorithms for finding domains are somewhat arbitrary. [Databases like CATH (= class, architecture, topology, homology) and SCOP help to standardize widely recognized domains.]

- note that the objective and methodology of compare and align are very different from the notion of structural superimposition. Superimposition assumes the structures are closely similar, and seeks to quantitate the similarity.- in superimposing two molecules, one designates a set of corresponding atoms (in identical molecules, usually all atoms; in protein variants, perhaps just “backbone” atoms (Cα – carbonyl-C – amide O – amide N), or only Cα's. The computation minimizes the square of the distance between these corresponding atoms in the two structures, and the difference is reported as the root-mean-square distance, “rmsd”.- superimposition is used to quantify spatial differences between related structures, e.g. the same molecule in different conformations, or between homologous structures.

Predicting folds denovo from sequence is a dicier proposition: - the outcome is very apt to depend on reasonable domain identification.- a general strategy is to deduce a “reduced complexity model”,expressed as a “profile” for each domain. Such “profiles” may encode the propensity of residues (usually sliding windows of residues) to form secondary structural elements and/or other residue patterns. The “profile” is matched against the profiles (similarly computed) of proteins of known structure, and a scored list of matches presented.- the pfam database, for example, is computed from a hidden Markov model (HMMER, by Sean Eddy) to recognize patterns of secondary structure and other topological motifs. PSI-BLAST (position-specific iterated BLAST; Altschul et al., NCBI) recognizes highly generalized sequence patterns

- given a set of proteins with similar profiles, the next step is to attempt to “thread” the test sequence onto the known structure(s) of the protein(s) of similar profile.- and finally to evaluate the tentative structure for its “protein health”; that is, to check for van der Waals overlaps, bond strains, permissible Ramachandran plots.- protocols to automate this workflow are receiving much attention, driven by the mass of protein sequences revealed in proteomics experiments. (So far with measured success, though the identification of specific family motifs, e.g. transmembrane proteins, is quite successful.)

Protein Structure and Function Chem/Biol 8360; Fall Semester 2009 (mini-semester I) Class 6a: “Structural Predictions an