Binding MOAD

CSAR and Binding MOAD: Two different databases,two different aims, one common goal provide the best protein-ligand dataJames B. Dunbar Jr. and Heather A. Carlson5th Meeting on U.S. Government Chemical Databases and Open ChemistryAugust 25th and 26th 2011

Binding MOAD • Who are we : • Heather Carlson – Principle Investigator • Mark L. Benson • Richard Smith • Nickolay Khanazov • Leigi Hu • Michael Lerner • John Beaver • Brandon Dimcheff • Jason Nerothin • Jayson Falkner • Peter Dresslar • James Dunbar Jr.

Binding MOAD

Binding MOAD BUDA HTML GATE NLP program Tagged HTML + annotated XML => Scores + text highlights Web App used to aid in manual biodata extraction and curation • For 2010 update: ~1200 manuscripts to review manually for data • ~2800 new PDB structures

Binding MOAD BUDA

Binding MOAD • Time consuming steps: • Obtaining the html from the journals – format changes • Random – different every year • Hand curation of data within BUDA • Correct data for compound • Correct sequence for the crystal • BUDA – essential for bookkeeping of the curation process • Allows for multiple people to work on the curation • Keeps track of changes and comments on a per user basis • Stores all in MySQL – records all work done over the years • Orders manuscripts by likelihood of data to top

CSAR Specific Aims CSAR CSARdock.org Community Structure-Activity Resource SA1. Build the largest, high-quality, freely accessible database of protein-ligand complexes with experimentally determined binding affinities from literature. SA2. Generate new experimental data: We propose experimentally determining the dissociation constants (Kds) for selected protein-ligand complexes using two complementary techniques: isothermal calorimetry (ITC) and surface plasmon resonance (SPR). Consistency between the two approaches would provide confidence in the data. Furthermore, important physicochemical properties for the ligands will be determined (logP/logD, pKa, and solubility), and additional crystal structures will be solved. (Note: actually using ITC, Octet Red, and ThermoFluor – Wuxi Apptec is measuring the properties) SA3. Curate data from the community SA4. Community outreach

CSAR – the people • Who are we : • Principal Investigators • Heather Carlson • Jason Gestwicki • Jeanne Stuckey • Shaomeng Wang • Researchers • William Clay Brown • KrishnapriyaChinnaswamy • James Delproposto • James Dunbar Jr. • Emilio Esposito • You-Na Kang • Ginger Kubish • Richard Smith • Kelly Damm-Ganamet • Who are we (cont.): • Consultants • Philip Andrews • Charles Brooks III • Hollis Showalter • Janet Smith • Web Programming • Shelly Yang • System Administration • Allen Bailey • Advisory Board • Michael Gilson • Philip Hajduk • Paul Labute • Deborah Loughney • Anthony Nicholls • Tudor Oprea • Catherine Peishoff • Peter Preusch • Alexander Tropsha • Janna Wehrle

CSAR – dataset example

CSAR – compound properties

CSAR – crystallography

Industrial and In-house efforts • Abbott • 4 datasets in progress • Genentech • Signed CDA • 5 datasets in progress • GSK • Signed CDA • Roche • 1 dataset deposited • BMS and Pfizer • CDA in for legal review • In-house • CDK2 (done), CDK2/cyclinA (final stages), Lpxc(ongoing), urokinase and Hsp90 (initial stages)

Dataset Selection Process (1) • Analyze target – does is it have crystal structures and have defined series (expect ~ 3 or so series) with appropriate biological data? • Analyze crystal data – does each have sufficient data present to refine the density? (.cv, .mtz, scale.log, …) – if so collect into a directory • Obtain biological data on all compounds tested in the relevant assay and any applicable counter screens (Ki, Ka, IC50, - no %inhibition) • Export from corporate database the: • Structure (smiles) • Company identifier • Biological data for screens – including those in crystal structures • Split data into three types: actives, inactives, crystal structures. • For crystal structures obtain any PDB ids if available.

Dataset Selection Process (2) • For the Actives • Move into MOE with pK(x) or pIC50 values: • Wash and calculate physical properties: • Hydrogen bond acceptors (Acc) • Hydrogen bond donors (Don) • Total number for combined Acc and Don • Heavy atom count • Rotatable bond count • SlogP • TPSA • Weight • Tag each entry as to series • Select using MOE diverse selection ~40 for each series based on pK(x) or pIC50 and Acc and Don • Check spread in other characteristics to be sure they are not skewed and by eye verify a spread in available chemical functionality.

Dataset Selection Process (3) • For the Actives (identify previous release of compound data) • Extract the ligands for the target from BindingDB/ChEMBL and load into MOE then export smiles. • Export the selected set from MOE (all fields) into text with structure as smiles. • In Pipeline Pilot –using canonicalized smiles – check to see if any selected is in BindingDB. If yes – select suitable replacement – if not then selection stands.

Dataset Selection Process (4) • Find the Inactives • Many should be extremely similar to crystal structure • Using Pipeline Pilot search the inactives with the smiles from the crystal structure to find those very similar to known crystal structure. • MDLpublic keys with 0.85 to 0.99 as range. • Select ~10 • If ~10 are found then check BindingDB (Pipeline Pilot) for any that are in literature. If yes – select suitable replacement – else selection stands • If only 1 or 2 (or none) then continue in MOE

Dataset Selection Process (5) • For the Inactives not extremely similar to crystal structure • Move into MOE : • Wash and calculate physical properties: • Hydrogen bond acceptors (Acc) • Hydrogen bond donors (Don) • Total number for combined Acc and Don • Heavy atom count • Rotatable bond count • SlogP • TPSA • Weight • Tag each entry as to series • Select using MOE diverse selection ~10 for each series based on Acc and Don • Check spread in other characteristics to be sure they are not skewed and by eye verify a spread in available chemical functionality. • Check BindingDB (Pipeline Pilot) for any that are in literature.

Datasets – lessons learned • Biological data • Attention to the details – • LpxC – just enough Zn to be active (catalytic site), but not enough to cause inhibition from secondary inhibitory site for Zn • Need to be aware of inherent error limits • Solubility can be a big issue • Particularly how it is handled • i.e. filtered solids from ligand before injecting into ITC • Protocols – did they use exactly what was published • Store output from assays in PDF – spectra, etc. • Allow end users to see and judge what they want to include for themselves • Crystallography – check the quality, provide density • Many different metrics – for us RSCC (real space correlation coefficient) for ligand is very important – but we use several • Setting up lots of proteins for docking and scoring can be a bear • Getting approval of legal departments – very time consuming • Initial confidentiality agreement • Approval of individual compounds for release

Thank you and any comments or questions

Binding MOAD

Binding MOAD

Presentation Transcript

Binding Theory

Binding Energy

Binding Energy

Interatomic Binding

Foot-Binding

Foot Binding

Dynamic Binding

Data Binding

Foot binding

Binding Energy

Binding

Streptavidin binding

Binding energy

Binding energy

Binding Machines

Binding Machines

Binding

Binding

Dynamic Binding

Binding