1 / 36

Transferable Atom Equivalent (TAE) augmented Protein-ligand scoring function

Transferable Atom Equivalent (TAE) augmented Protein-ligand scoring function. M. Dominic Ryan DRI Wei Deng, Mark J. Embrechts, Theresa Hepburn, Curt M. Breneman Rensselaer Polytechnic Institute. DRI. Scoring functions. Harder than getting the pose right Two or three main classes:

vivek
Download Presentation

Transferable Atom Equivalent (TAE) augmented Protein-ligand scoring function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transferable Atom Equivalent (TAE) augmented Protein-ligand scoring function M. Dominic Ryan DRI Wei Deng, Mark J. Embrechts, Theresa Hepburn, Curt M. Breneman Rensselaer Polytechnic Institute DRI

  2. Scoring functions • Harder than getting the pose right • Two or three main classes: • Knowledge-based methods • force field methods. • First principles: • Define a framework to capture fundamental physics and correct functional form. Typically use a conventional molecular mechanics approach, augmented with better configurational sampling and more sophisticated treatment of electrostatics to compute binding DGs • Docking programs with these include Glide, DOCK, Autodock, Gold (Goldscore), and functions such as CHARMM, MMFF, OPLS. • Polarizable force fields are an emerging area (and have been for a while) • Sampling is critical. • Holy Grail?

  3. Scoring functions • Knowledge-based: • Learn from existing complexes and count interactions. What you count may be mechanics type or not. • Use pharmacophoric atom types and ligand / protein distances to train a function on known complexes. The choice of atom types varies across different methods. • Often have a speed advantage, may also benefit from an abundance of closely related cases. • Provides an implicit estimation of DG binding through fitting to empirical data. • Examples: FlexX, QXP, PLP, Ludi, Chemscore, SMOG, • Many reviews demonstrate no clear winners, though empirical potentials may still have the edge. • Can we capture a new kind of information in the data? Electron density derived properties

  4. TAE/RECON properties • Molecular surfaces and surface property distributions can be reconstructed using electron density-encoded atom types • QTAIM approach was used to generate the large library of structurally-distinct RECON/TAE atom types • C.M. Breneman and M. Rhem, J. Comp. Chem, Vol 18:2, 182-197 (1997) • A connectivity-based selection algorithm is used to assign specific atom types using 2-D connection tables • 3-D property reconstruction accomplished using TAE primitives • RECON molecular properties may be used to rapidly generate Electron Density-Derived molecular TAE descriptors and Wavelet Coefficient Descriptors (WCDs) for large datasets

  5. Types of Properties • Electrostatic Potential • Electronic Kinetic Energy Density • Electron Density Gradients .N • Laplacian of the Electron Density • Local Average Ionization Potential • Bare Nuclear Potential (BNP) • Fukui function F+(r) = HOMO(r) TAE/RECON properties • Scalar Properties • Property Extrema • Integral Average of Property • Surface Histogram of a Property • Scalarized Vector Properties • Property Extrema • Integral Average of Property • Surface Histogram of a Property • Fractional Surface Histograms (scaled to molecularsize)

  6. Method • Select high reliability complexes • Major concern, several choices • Identify atoms lining whole binding cavity • Not just contact atoms • Compute TAE derived properties for protein, extract those for binding site atoms • Compute TAE derived properties for ligand • Extract Distance based descriptors • Extract TAE summed descriptors • Build new function from descriptors using RPI tools. • QSAR problem.

  7. Select high reliability complexes • Identify atoms lining whole binding cavity • Compute TAE derived properties • Extract Distance based descriptors • Extract TAE summed descriptors • Build new function. Protein complexes • Not all crystal structures are ‘good’. • Missing density, partial occupancy, bad fit geometries of ligands, R-factors articificially low due to too many waters, all found even with high resolution structures. • Even with excellent structures, multiple ambiguities can remain. Sidechain orientations of ASN, GLN and HIS are not visible in density maps. O/N/C are ~ same in density maps! • We require all atoms for RECON. The protonation state of HIS, GLU and ASP must be defined. • Could use good estimation of pKa. • Hydrogen bond donors, OH of SER and Thr, require optimization • Depends on all the above. • Use context to define them with REDUCE. • M. Word et. al. J. Mol. Biol. 285, 1733-1745 • http://kinemage.biochem.duke.edu/software/reduce.phpfor a new version by Jack Snoeyink, UNC-CH

  8. Select high reliability complexes • Identify atoms lining whole binding cavity • Compute TAE derived properties • Extract Distance based descriptors • Extract TAE summed descriptors • Build new function. Protein complexes • Several collections exit (PDBBIND…), what to start with? • Inspection of coordinates may not be enough. • Many other reasons for docking failures remain:Must look at density maps! • Astex (with CCDC) created a Gold validation set • Three key filters applied to minimize misleading situations: • Symmetry contacts that direct binding. • Bad clashes in ligand / protein interface. • Structural errors revealed by the electron density. • Structure ambiguities hand curated. • Literature checked and protonation states adjusted accordingly.

  9. Select high reliability complexes Identify atoms lining whole binding cavity Compute TAE derived properties Extract Distance based descriptors Extract TAE summed descriptors Build new function. Astex / Gold set* • Available from CCDC • 224 ‘clean’ entries are available, the set of 92 with resolution < 2A was used. • The set consists of the binding site region of the proteins. • Beware: lone atoms, partial residues etc. • RECON required full proteins (and most would want whole residues). • These binding sites were re-integrated with the full protein • Resolve protonation and orientation ambiguities of whole protein with REDUCE. • Patch the Astex binding sites back into the protein • Batch (SVL) was messy, many issues of symmetry sets, dimer proteins etc. • Special thanks to CCG here! • 71 complexes were kept representing ~ 35 classes. * J. W. M. Nissink, C. Murray, M. Hartshorn, M. L. Verdonk, J. C. Cole and R. Taylor, Proteins, 49(4), 457-471, 2002

  10. Two asps, one must be protonated

  11. Complexes used

  12. Select high reliability complexes Identify atoms lining whole binding cavity Compute TAE derived properties Extract Distance based descriptors Extract TAE summed descriptors Build new function. Binding site atoms • Close contacts contribute directly to binding • More distant contact might be ‘missing’ contacts that could contribute with some ligands but are lacking with others. • Find all binding pocket atoms, not just contact ones, but only surface ones. • Define binding pocket using MOE’s alpha-shape tool. • Identify all pockets. • Find the one the ligand occupies. • Create a molecular surface lining the pocket. • Identify the atoms closest under the surface. • Do in batch. • This work uses only the immediate contact atoms, i.e. a very shortcutoff. QXP also uses very short cutoffs and has been shown to work well.

  13. 1EED: ligand binding site

  14. Alpha-spheres

  15. Matches ligand location

  16. Select high reliability complexes Identify atoms lining whole binding cavity Compute TAE derived properties Extract Distance based descriptors Extract TAE summed descriptors Build new function. RECON • RECON is used to calculate TAE based properties on all protein and ligand atoms. • TAE properties are computed for all atoms of the protein and those on the binding site atoms are extracted. • EP, Delr, K, G, L, PIP, BNP, Fuk are divided into bins. To this are added total surface area and other extrema producing 147 descriptors per atom. • The atomic surface area atributable to each bin is summed across the binding site to create a ‘molecular’ TAE descriptor resulting in 147 descriptors for the binding site and ligands each. • The aggregated set of descriptors is the basis of a QSAR, where the response is experimental binding affinity: pI.

  17. Electron Density Derived Descriptors • Surface Histograms • Surface elements with electronic properties within specified ranges are summed into a number of bins. • Laplacian: Orange region represents one histogram bin

  18. Distance Dependant Atom Type • Simplified atom types defined: • C.3, C.2, C.ar, C.cat, N.3, N.ar, N.am, N.pl3, O.3, O.2, O.co2, S.3, P.3, F, Cl, Br, Metal. • No hydrogen types! • Histograms of distance bins constructed for each protein-ligand atom pair encountered • 1A bins between 1A and 6A distances. The current set cut off of 6A will be extended in increments. • Each complex results in 147 TAE descriptors and a large number of DDAT descriptors for all protein-ligand atom pairs. • TAEs are all-atom!

  19. Select high reliability complexes Identify atoms lining whole binding cavity Compute TAE derived properties Extract Distance based descriptors Extract TAE summed descriptors Build new function. Data Analysis Techniques Feature Selection and Model Building • GA/PCA - Classification • Support Vector Machines • SVM Feature Selection • SVM Regression • SVM Classification • Partial Least-Squares Regression • GA / Linear PLS • Neural Network Analysis • Stripminer™ • GAFEAT

  20. Scoring function Feature selection -> model construction • Filter methods • Access the merits of features from the data alone • Does not take into account the biases of learning algorithm. • 4s outlier removal performed • Wrapper methods • Search for an optimal feature subset tailored to a particular learning algorithm • Use the learning algorithm as a black box for evaluating feature subsets • GA used to drive final set • Data partitioning • Both whole model and N-6 (random) used

  21. Results • All complexes used • Linear PLS with GA feature selection run for 200 generations. to build comprehensive model. • Model 1 built from distance dependent atom type (DDAT) descriptors alone. • Model 2 built with the addition of TAE descriptors. • Model 1 • Best model has 4 latent variables comprising 56 descriptors from both TAE and DDAT. • LOO cross-validation resulted in r2 = 0.72 • Model 2 • Are TAEs adding more than dimensionality? • Required 7 latent variables • Fewer descriptors initially, but more latent variables required in this model, and 72 descriptors kept after GA

  22. Model 1. Experimental vs Calculated pI

  23. Results • N-6 partition • Random, not class driven • DDAT model • DDAT + TAE • Early results, but training model shows improvement upon addition of TAE-based information. • Test set • DDAT • Some significant outliers • DDAT + TAE model applied • Qualitatively closer

  24. DDAT only training model

  25. TAE + DDAT training model

  26. DDAT only results on N-6 set: Unit slope line

  27. Predictions from TAE+DDAT Unit slope line

  28. Unexpected lessons • The outliers contain interesting information • 52 of 71 complexes have caculated pI errors < .5 log unit. • Behavior of error function non-linear • Inspection of outliers after the break: • No ‘class’ dominance in outliers • No resolution limit dominance • Interesting structural issues. • Astex set was hand curated: • Density fit checked • Clashes ruled out • Protonation states or protein adjusted • Ligands subjected to local optimization and checked against literature.

  29. Outliers: Errors pI units

  30. 1EED: problem contact

  31. Local opt. did not help

  32. 1ELD: missing acetate

  33. 1ELD: affected His protonation H is not present in Astex complex !

  34. Issues • DDAT is a united atom representation. • TAE atom types are comprehensive • TAEs clearly identifying some H-dependent issues. • Will TAE properties still add a significant increment if all atom types are used? • TAEs might not be available for a very wide range of ligands yet • Is it scalable?

  35. Future work • Evaluate more complete atom type set of DDAT. • Include Distance dependent TAE-based descriptors rather than site-based sums. • Initial study was limited to a small set of high quality structures • Add structures from different collections, lower resolution structures from Astex validation set, PDBBind collection. • With larger data set partition into leave class out models • Sensitivity of score to small structural changes • Cross-docking • Multiple ligand configurations • Performance against other scoring functions • Cautions though, comparisons are fraught with difficulties.

  36. Acknowledgements • Chris Williams, Ken Kelly, Andrew Henry, William Long, and the rest of the support crew at CCG. • N. Sukumar. • Colin McMartin, Thistlesoft. • Granting agencies for support of CB, NS, WD, TH, …

More Related