Phasing, Concluded; Fitting and Refinement

Phasing, Concluded;Fitting and Refinement Andy HowardBiology 55520 & 25 September 2018

Agenda • Principles of molecular replacement • Practicalities of molecular replacement • Good or evil • Applications • Practicalities • Density modification, reconsidered • Electron density fitting • Structure refinement • Structure validation Thanks to Rick Walter for assembling the molecular replacement notes!

} A Crystal OR...Imagining Proteins to be Peanuts/Neck Pillows The Concept of MR A Search Model A correctly rotated Search Model A correctly translated Search Model P …”P” is for PROTEIN

A very clear 3 atom protein Patterson... atomic vectors added for clarity Add Some “Patterson Vectors” The Basics of MR: The RotationA “Theoretical View” using “Traditional Methods” A nice, “typical” 3 atom protein structure A Patterson map...looks familiar, but not quite right A nice search model Let's try rotating it We Got it! OK...sure, you have to tumble it in a third rotation (not shown)‏ ...but that's easy...so THIS is EASY!

t The Basics of MR: The Translation Correctly rotated molecule sitting at unit cell origin t = 3D translation So, that looks even EASIER!

So, MR is EASY...a technique for GOOD! What could POSSIBLY go wrong!

Two molecules in the cell from a dimer or just crystal symmetry)‏ Let's try a 5 atom protein: Whoa!... That's much more complex ...and it's only 10 atoms! ...a little more confusing, but still OK...I think? What if I don't have ALL the atoms right? What if the rotation is wrong? So, I might get the vector positions correct...but not their magnitudes??? Some vectors STILL overlap! One Slide to TOTAL Confusion

Proteins are Complex • Average residue contains 8 “heavy” atoms • Average protein contains 300 amino acids • Average structure contains 2400 atoms

A Protein An Asymmetric Unit A Unit Cell A Crystal! A “Model” for our Protein Let's Get back to “PRACTICAL” Our Hero

A Protein Our “body-foot domain” looks good...but something's not quite right about the “head domain” 22º CW Our “head domain” looks good...but now look at the “body-foot domain” 47º CW Revisiting Rotation • So, our model that looked so good may not be so good

A Protein Now both our “body-foot” and “head” domains look good...even got some ears! 22º CW An Improved Model Revisiting Our Model • Excellent!...but how would we know to build such a model a priori? A

You have to find ALL the contents of the AU Our rotation & translation look good...but the cell looks too empty ...And “One” Other Thing OK...we found a 2nd rotation & translation” …sort of ??? but there's still something wrong? ...AU contents don't have to be “identical” ...AND…they have to pack reasonably!

≠ ≠ ≠ because... ≠ ≠ because... ...But wait, that's not all! There are LOTS of atoms in secondary structural elements which means there are a LOT of resulting Patterson vectors ...RIGHT or WRONG!

What We REALLY Learned • Happy Bunnies are Insidious & Evil • MR is Evil! • Why would ANYONE ever do this horrible technique?

Successes & Failures of MR • Now that we have talked about why MR should not work… • Perhaps we can talk about how to make it work • Because….MR actually DOES work!

The Simple Answer to “Practical” SOLUTIONRC 1 21.96 55.01 328.44 0.0000 0.0000 0.0000 14.0 55.6 25.3 19.6 1 SOLUTIONRC 1 9.00 54.87 327.31 0.0000 0.0000 0.0000 8.0 57.1 14.2 11.0 2 SOLUTIONRC 1 39.10 75.66 28.54 0.0000 0.0000 0.0000 7.4 57.5 13.6 10.8 3 SOLUTIONRC 1 21.50 28.68 43.50 0.0000 0.0000 0.0000 7.5 57.3 15.1 10.0 4 SOLUTIONRC 1 61.63 76.42 43.19 0.0000 0.0000 0.0000 8.2 56.8 14.4 9.9 5 SOLUTIONRC 1 71.12 48.16 211.00 0.0000 0.0000 0.0000 8.4 57.0 14.4 9.8 6 SOLUTIONRC 1 59.98 50.91 330.51 0.0000 0.0000 0.0000 8.0 57.2 15.1 9.8 7 • If you see this…You are golden • If you do NOT see this….GIVE UP! • Just kidding…sort of!

Resolution Fold Conservation Example 1 Example 2 ALL TYROSINE KINASE DOMAINS! Why You Often Can “See This”

Why Resolution Helps • X-ray Data between 3.5 – 6Å will do • Anything higher slows you down or even hurts you …than to match these Much easier to match these

18% Identity 15% Identity It’s all about the starting structure The 3D Structure …NOT the Amino Acid Sequence So how do you get a good starting structure???

Option 1: Try a simple BLAST • All solved structures (sure!) are deposited in the Protein Data Bank (http://www.rcsb.org/pdb/) • BLAST your amino acid sequence (or a Swiss-Prot accession number) against the PDB structure database: • Try http://expasy.org/tools/blast/or http://www.ncbi.nlm.nih.gov/BLAST/ • 20%+ sequence identity usually means similar 3-D structures.

Option 2: Structural Overlap • Take a diverse subset of your BLAST results. • Structurally overlap this subset using any number of available tools: • Most graphics programs: Quanta, Coot, etc. • On-line servers for 3D structure comparison: Combinatorial Extension (http://cl.sdsc.edu/), Dali (http://www.ebi.ac.uk/dali/), a good comprehensive list is at (http://en.wikipedia.org/wiki/Structural_alignment_software). • Look for a highly conserved core and try several of the structures that closely match it or trim some of the structures down to this core.

Option 3: Model Guided Structure ID • Submit your sequence to a threader(e.g., 3D Jigsaw: http://bmm.cancerresearchuk.org/~3djigsaw/);FUGUE:http://tardis.nibio.go.jp/fugue/prfsearch.html)or similar model building server. • Many databases and servers of programs exist: (http://mbcf.dfci.harvard.edu/cmsmbr/biotools/biotools9.html);(http://www2.uah.es/biomodel/pe/protexpl/psbiores.htm ) • My personal favorite is the Meta server athttp://bioinfo.pl/ • Throw the models themselves away but pay attention to what PDB files were used to construct the models. • Make a list of the top 20 – 30 PDB files that were used most frequently and structurally overlap them • Repeat “Option 2” with this test set

Option 4: Make Friends with MS • Run your protein on an SDS-PAGE gel. • Give the gel to a skilled Mass Spectrometrist and have her/him cut out the band, tryptic digest the extracted band, and run LC-MS. • Have your MS friend run the tryptic fragment map against his/her database of such digests. • Take the list of proteins IDed by the MS mapping and BLAST them against the PDB, repeating Option 1, 2, and 3 as necessary with these “hits”. • This is a great way to quickly get the structure of a protein if you don’t even know the sequence!

Option Last:Build Homology Models • Take the models that were generated by “Option 2” out of the garbage and use them in MR attempts. • Build homology models by any other standard method that you or (preferrably) a skilled modeler friend of yours uses. • Set up heavy atoms soaks; see if you have enough sulfur anomalous single; hope that yours is an unrecognized metallo-enzyme with a reachable edge; undertake MAD.

P41212 SOLUTIONTF1 1 21.96 55.01 328.44 0.0073 0.2403 0.2567 41.7 45.2 41.4 1 27.7 SOLUTIONTF1 1 9.00 54.87 327.31 0.8385 0.2499 0.1907 18.9 54.9 20.3 3 10.9 SOLUTIONTF1 1 39.10 75.66 28.54 0.8766 0.7239 0.1881 18.0 55.2 19.7 10 22.3 SOLUTIONTF1 1 21.50 28.68 43.50 0.8838 0.4984 0.2995 18.1 55.4 20.7 2 16.1 SOLUTIONTF1 1 61.63 76.42 43.19 0.9739 0.2338 0.3070 19.1 54.3 20.1 6 23.7 SOLUTIONTF1 1 71.12 48.16 211.00 0.7945 0.2326 0.1719 19.3 54.8 20.5 4 6.7 SOLUTIONTF1 1 59.98 50.91 330.51 0.6805 0.6670 0.0686 17.5 55.1 20.6 4 5.4 Great Solution!! P43212 SOLUTIONTF1 1 21.96 55.01 328.44 0.0063 0.7394 0.0000 69.1 34.9 70.5 1 28.5 SOLUTIONTF1 1 9.00 54.87 327.31 0.5901 0.4253 0.4508 19.4 55.5 19.2 6 15.4 SOLUTIONTF1 1 39.10 75.66 28.54 0.1019 0.3282 0.2175 18.0 55.3 21.0 2 24.3 SOLUTIONTF1 1 21.50 28.68 43.50 0.7073 0.4348 0.4263 17.7 55.2 21.1 3 20.8 SOLUTIONTF1 1 61.63 76.42 43.19 0.2749 0.6853 0.3370 17.9 55.0 18.9 1 8.0 SOLUTIONTF1 1 71.12 48.16 211.00 0.9809 0.4550 0.0000 18.2 55.4 18.9 2 11.0 SOLUTIONTF1 1 59.98 50.91 330.51 0.3996 0.5807 0.1555 17.5 55.6 21.3 1 7.5 Even Better Solution?? A final caveat • Don’t throw away what seems like a guaranteed MR solution because the maps look like crap: make sure that you checked ALL enantiomorphic space groups!

Solvent Flatteningand Phase Combination • The next few pages contain notes developed by Dr. William Furey of the University of Pittsburgh and the Pittsburgh VA Medical Center.

Density modification • Typically, after applying a phasing method like isomorphous replacement, we have an electron density map that contains significant information but is far from ideal. • How do we improve the map quality? • We can get a hint about how to do so by remembering the relationships between electron density and structure factors.

Electron density and structure factors • Electron density expressed as a function of structure amplitudes and phases: • Structure factors (amplitude and phase) expressed as a function of electron density values:

Density and structure factors • From these relationships we see that every phase contributes to, and therefore helps determine the electron density at every point in the unit cell. • Likewise, the density at every point contributes to, and helps determine the phase (and amplitude) of every structure factor.

Phase errors and structural errors • If errors in the phases lead to map features that are unrealistic or physically impossible, then one might be able to improve the phases by: • Eliminating those non-physical features from the map • Replace them with more plausible features • Invert the Fourier transform to generate a new set of (hopefully improved) phases

What would we do with those? • Combine the improved phase angle estimates with the experimental amplitudes |Fhkl| derived from the experiment and create a better electron-density map • We hope this better map would look more like a real macromolecule and have fewer spurious features in it. • So let’s explore how that would be done.

Steps in modifying density • Decide where the density should be modified • Decide how the density should be modified • Decide how the resulting phase estimates should be combined with the previously-determined phase estimates • Decide when to stop iterating this procedure

Where should the map be modified? • Obviously we want to modify it where it’s wrong! • But we don’t know in advance where it’s wrong, so we need to look for physical principles that tell us that it’s wrong. • One principle that’s easy to articulate: • Electron density should be non-negative everywhere, i.e. (x,y,z) ≥ 0 for all (x,y,z) • Therefore if we find a point where the density is negative, it must have gotten that way by mistake.

The idea behind solvent flattening • This second principle is also straightforward: • We know, with less certainty, but still reasonable confidence, that the electron density in liquid portions of the unit cell (e.g. the solvent regions) should be essentially featureless. • Therefore we can assume that the presence of big peaks or valleys in the solvent region must be due to errors.

The idea behind non-crystallographic symmetry averaging • If there are multiple copies of the same molecule present within the crystallographic asymmetric unit, one would assume the molecules will be in the same conformation and thus have identical electron densities. • Therefore differences in electron density at the corresponding places must be due to errors. • We’ll get back to this idea later.

So how do we implement negative density truncation? • Can we just set all the density values to zero if they come in at negative values? • Not quite. There are a few problems. • To compute true  values we have to have measured the F values on an absolute scale, not a relative scale. Our intensities are not calculated on an absolute scale, although we can sometimes determine roughly what the scale factor should be.

F000 • The correct calculation of (x,y,z) requires all the F values, including h=k=l=0, i.e. the F000 (“foo”) term. This is the biggest term, and its magnitude is essentially impossible to measure.

Series termination errors • Any time you approximate an infinite sum like this one with a finite number of terms, you get a few inaccuracies as a direct result of the termination of the sum. • These can cause a few negative values to appear, especially near the boundaries of the macromolecule itself.

Dealing with these issues • Getting the absolute scale is straightforward if the data extend to 2Å; it’s trickier at lower resolution: • Plot ln(<I>) in shells of sin/ and extrapolate back to sin/ = 0 (a “Wilson plot”); • this can give you the absolute scale if you know the total molecular weight in the asymmetric unit, which you do know as long as you have an idea of how much solvent is present.

Alternative way of dealing with the absolute scale • Following B.C.Wang’s treatment: • We assume that the maximum density within the protein is similar to that found in other proteins, and the average density in the solvent reason is some reasonably constant and measurable value. • We then compute S, the ratio of the average solvent density to the maximum protein density.

Using the density ratio, S • Tabulate values of S for known structures. • Then solve this equation for F000/V: • {<solvent>+F000/V}/{max,protein+ F000/V} = S • Since V is known, this gives us F000. • It turns out the best values of S to use are smaller than the theoretical ones, and are resolution-dependent.

S values and their use • For macromolecules with no atoms heavier than phosphorus: • Use S = 0.25 at 6Å resolution, 0.11 at 4Å, 0.09 at 3.5Å, 0.06 at 3Å. • Then if we omit the F000/V term from the summation, negative density truncation means setting the density to -F000/V wherever  < -F000/V • To make this work, we have to know what part of the unit cell is solvent and what part is macromolecule!

How do you define the solvent region? • You can do it manually by staring at the map and defining it by hand • This can work well if the map is reasonably clean to begin with, but it’s subjective • Alternatively, Wang describes an automated boundary determination approach. This method is objective and generally works.

Wang’s boundary method • I won’t go into the details here, but briefly: • A smeared (locally averaged) map is computed based on the original map. • A histogram of density values on the smeared map is constructed, and a cutoff value on the histogram is set so that the right percentage of the unit cell is in the solvent region and the right percentage is in the macromolecule region. • This allows us to define a mask such that the mask values are 1 inside the macromolecule region and 0 outside. This is described as a weighting function for the map.

Flipping this to reciprocal space • Defining this mask allows creation of a modified map that is the convolution of the raw map with the mask. • But the Fourier transform of a convolution is the product of the Fourier transform of its components, so we can do this calculation much more efficiently in reciprocal (diffraction-vector) space rather than real space! • So we can just multiply all our structure amplitudes by the transform of the mask function, and we will have used the solvent flattening implicitly!

What the mask might look like

Combining solvent flattening and negative density truncation • Once we’ve defined the solvent mask either in real space or reciprocal space, we can proceed with modifying our map: • Take the original map and the solvent mask and determine the mean value of the solvent electron density and the maximum value of the protein density. Compute F000/V from those values. • Modify  at each grid point according to • ’(x,y,z) = <solvent> + F000/V if in solvent region • ’(x,y,z) = (x,y,z) + F000/V if in protein region

How do we use this? • We then invert this modified map to get a new set of calculated structure amplitudes and phases. • On the next slide are the results of those steps for a one-dimensional example:

1-D map modifications • (a) Original with F000 omitted • (b) Setting  >= 0 • (c) Averaged map(convolution of b with weighting) • (d) solvent definition • (e) solvent flattening • (f) final map(add F000/V and truncating  < 0)

How do we use the new phases? • Create a map using the observed F values and the modified phase angles • Or combine the new phase information with the original information to compute a map with coefficients that look likem|Fp| exp(icomb) • Where m and comb are computed as follows:

Phasing, Concluded; Fitting and Refinement