Molecular Docking Ugur Sezerman Sabanci University
What is docking? Docking is finding the binding geometry of two interacting molecules with known structures The two molecules (“Receptor” and “Ligand”) can be: - two proteins - a protein and a drug - a nucleic acid and a drug Two types of docking: - local docking: the binding site in the receptor is known, and docking refers to finding the position of the ligand in that binding site - global docking: the binding site is unknown. The search for the binding site and the position of the ligand in the binding site can then be performed sequentially or simulaneously
What Are Docking & Scoring? • To place a ligand (small molecule) into the binding site of a receptor in the manners appropriate for optimal interactions with a receptor. • To evaluate the ligand-receptor interactions in a way that may discriminate the experimentally observed mode from others and estimate the binding affinity. complex ligand docking scoring receptor X-ray structure & DG … etc
Why Do We Do Docking? • Drug discovery costs are too high: ~$800 millions, 8~14 years, ~10,000 compounds (DiMasi et al. 2003; Dickson & Gagnon 2004) • Drugs interact with their receptors in a highly specific and complementary manner. • Core of the target-based structure-based drug design (SBDD) for lead generation and optimization. Lead is a compound that • shows biological activity, • is novel, and • has the potential of being structurally modified for improved bioactivity, selectivity, and drugeability.
Docking Applications • Determine the lowest free energy structures for the receptor-ligand complex • Search database and rank hits for lead generation • Calculate the differential binding of a ligand to two different macromolecular receptors • Study the geometry of a particular complex • Propose modification of a lead molecules to optimize potency or other properties • de novo design for lead generation • Library design
Key aspects of docking • Scoring Functions • What are they? • Which Scoring Functions are feasible? • Search Methods • How do they work? • Which search method should I use? • Which program should I use?
Docking Challenge • Both molecules are flexible and may alter each other’sstructure as they interact: • Hundreds to thousands of degrees of freedom • Total possible conformations are astronomical
Formulation of Docking Problem • A scoring function that can discriminate correct (experimentally observed) docking complex structure from incorrect ones • A search algorithm that finds the docking complex structure measured by the scoring function
Formulation of Docking Problem Factors Affecting ∆G0 Intramolecular Forces(covalent) • Bond lengths • Bond angles • Dihedral angles Intermolecular Forces (noncovalent) • Electrostatics • Dipolar interactions • Hydrogen bonding • Hydrophobicity • Van der Waals
Types of Docking Problems • Docking • Bound docking : the goal is to reproduce a known complex • Unbound docking : complex structure not known • Protein-Small Molecule Docking • Rigid receptor, rigid ligand • Rigid receptor, flexible ligand • Flexible receptor, flexible ligand
Docking strategies require: • Protein representation • A search method • Finalrefinement and scoring
1. Protein Structure • A 3-Dstructure of the target protein at atomic resolution must be available • Crystal and solution structures (PDB) • Homology models • Pseudoreceptor models • Ideally, the atomicresolution of crystal structures should be below 2.5 A • Even small changes in structure can drastically alterthe outcome
Receptor Structures & Binding Site Descriptions • PDB (Protein Data Bank, www.rcsb.org/pdb/) containing proteins or enzymes: • X-ray crystal: >60,000 structures,~10 % have ≤ 1.5 Å, ~80% between 1.5-2.5 Å • NMR:, ensemble accuracy of 0.4-1 Åin the backbone region, 1.5 Åin average side chain position (Billeter 1992; Clore et al. 1993) • (and high quality homology models built from highly similar sequences) • Limitation of experimental structures (Davis et al. 2003): • Locations of hydrogen atoms, water molecules, and metal ions • Identities and locations of some heavy atoms (e.g., ~1/6 of N/O of Asn & Gln, and N/C of His incorrectly assigned in PDB; up to 0.5 Å uncertainty in position) • Conformational flexibility of proteins • Binding site descriptions: atomic coordinates, surface, volume, points & distances, bond vectors, grid and various properties such as electrostatic potential, hydrophobic moment, polar, nonpolar, atom types, etc. DOCK
Drug, Chemical & Structural Space • Drug-like: MDDR (MDL Drug Data Report) >147,000 entries, CMC (Comprehensive Medicinal Chemistry) >8,600 entries • Non-drug-like: ACD (Available Chemicals Directory) ~3 million entries • Literatures and databases, Beilstein (>8 million compounds), CAS & SciFinder • CSD (Cambridge Structural Database, www.ccdc.cam.ac.uk): ~3 million X-ray crystal structures for >264,000 different compounds and >128,00 organic structures • Available compounds • Available without exclusivity: various vendors (& ACD) • Available with limited exclusivity: Maybridge, Array, ChemDiv, WuXi Pharma, ChemExplorer, etc. • Corporate databases: a few millions in large pharma companies
3D Structural Information & Ligand Descriptions • 2D->3D software: CORINA, OMEGA, CONCORD, MM2/3, WIZARD, COBRA. (reviewed by Robertson et al. 2001) • CSD: <0.1 Å for small molecules, but may not be the bound conformation in the receptor • PDB: ligand-bound protein structures ~6000 entries • Atoms associated with inter-atom distances, physical and chemical properties, types, charges, pharmacophore, etc • Flexibility: conformation ensemble, fragment-based
Scoring Functions • A fast and simplified estimation of binding energies scores <-> DGbinding X-ray structure -scores ? configurations of the complex
3. Scoring Functions • Factors Affecting ∆G0 Intramolecular Forces(covalent) • Bond lengths • Bond angles • Dihedral angles Intermolecular Forces • Electrostatics • Dipolar interactions • Hydrogen bonding • Hydrophobicity • Van der Waals
Types of Scoring Functions • Force field based: nonbonded interaction terms as the score, sometimes in combination with solvation terms • Empirical: multivariate regression methods to fit coefficients of physically motivated structural functions by using a training set of ligand-receptor complexes with measured binding affinity • Knowledge-based: statistical atom pair potentials derived from structural databases as the score • Other: scores and/or filters based on chemical properties, pharmacophore, contact, shape complementary • Consensus scoring functions approach
3. Scoring Functions Force Field Based • CHARMM [Brooks83] • AMBER [Cornell95] Empirical methods: • ChemScore [Eldridge97] • GlideScore [Friesner04] • AutoDock [Morris98] • AutoDock Vina[Trott09] Knowledge-based methods • PMF [Muegge99] • Bleep [Mitchell99] • DrugScore [Gohlke00]
Force Field Based Scoring Functions • Advantages • FF terms are well studied and have some physical basis • Transferable, and fast when used on a pre-computed grid • Disadvantages • Only parts of the relevant energies, i.e., potential energies & sometimes enhanced by solvation or entropy terms • Electrostatics often overestimated, leading to systematic problems in ranking complexes e.g. AMBER FF in DOCK
Molecular mechanics force fields • Usually quantify the sum of two energies • thereceptor–ligand interaction energy • internal ligandenergy (such as steric strain induced by binding) • Interactions between ligand and receptor are mostoften described by using van der Waals and electrostatic energy terms. • Advantages • FF terms are well studied and have some physical basis • Transferable, and fast when used on a pre-computed grid • Disadvantages • Only parts of the relevant energies, i.e., potential energies & sometimes enhanced by solvation or entropy terms • Electrostatics often overestimated, leading to systematic problems in ranking complexes
Molecular mechanics force fields CHARMM [Brooks83]
Molecular mechanics force fields AMBER: [Cornell95]
FF Scoring: Implementations • AMBER FF: DOCK, FLOG, AutoDOCK • CHARMm FF: CDOCK, MC-approach (Caflisch et al. 1997) • Potential Grid: rigid receptor structure upon docking. The grid-based score interpolates from eight surrounding grid points only. 100-fold speed up. Examples: DOCK, CDOCK, and many other docking programs. • Soften VDW: A soft-core vdw potential is needed for the kinetic accessibility of the binding site (Vieth et al. 1998). FLOG: 6-9 Lennard-Jones function; GOLD: 4-8 vdw + H-bond, and intraligand energy. • Solvent Effect on Electrostatic: often approximated by rescaling the in vacuo coulomb interactions by 1/D, where D = 1-80 or = n*r, n = 1-4, r = distance. • Solvation and Entropy Terms: Solvation terms decomposed into nonpolar and electrostatic contributions (e.g., DOCK):
Empirical Scoring Functions • Goals: reproduce the experimental values of binding energies and with its global minimum directed to the X-ray crystal structure • Advantages: fast & direct estimation of binding affinity • Disadvantages • Only a few complexes with both accurate structures & binding energies known • Discrepancy in the binding affinities measured from different labs • Heavy dependence on the placement of hydrogen atoms • Heavy dependence of transferability on the training set • No effective penalty term for bad structures LUDI & FlexX (Boehm 1994)
Empirical Scoring: Implementations Mostly differ by what training set and how many parameters are used • Cerius2/Insight2000: LUDI, ChemScore, PLP, LigScore • SYBYL: FlexX, F-Score • Hammerhead: 17 parameters for hydrophobic, polar complementary, entropy, solvation. sLOO= 1.0 logK for 34 complexes • VALIDATE: 8 parameters for VDW and Coulomb interactions, surface complementarity, lipophilicity, conformational entropy and enthalpy, lipophilic and hydrophilic complementarity between receptor and ligand surfaces • PRO_LEADS: 5 coefficients for lipophilic, metal-binding, H-bond, and a flexibility penalty term. sLOO= 2 kcal/mol for 82 complexes • SCORE (Tao & Lai, 2001); ChemScore (GOLD)
Knowledge-based Potentials of Mean Force Scoring Functions (PMF) • Assumptions • An observed crystallographic complex represents the optimum placement of the ligand atoms relative to the receptor atoms • The Boltzmann hypothesis converts the frequencies of finding atom A of the ligand at a distance r from atom B of the receptor into an effective interaction energy between A and B as a function of r • Advantages • Similar to empirical, but more general (much more distance data than binding energy data) • Disadvantages • The Boltzmann hypothesis originates from the statistics of a spatially uniform liquid, while receptor-ligand complex is a two-component non-uniform medium • PMF are typically pair-wise, while the probability to find atoms A and B at a distance r is non-pairwise and depends also on surrounding atoms
PMF: Implementations • Verkhivker et al.(1995): 12 atom pairs, 30 complexes(HIV-1 and simian immunodeficiency virus). Test on 7 other HIV-1 protease complexes • Wallqvist et al. (1995): 38 complexes, 21 atom types (10 C, 5 O, 5 N, 1 S). Test on 8 complexes sd=1.5 kcal/mol, and 20 complexes rmsd=1.0 A. • Muegge et al. (1999): 697 complexes, 16 atom types from receptor & 34 from ligand, 282 statistically significant PMF interactions.Test on 77 diverse compounds: sd=1.8 log Ki. The PMF was combined with a vdw term to account for short-range interactions for DOCK4 docking: • DrugScore (Gohlke et al, 2000), FlexX, BLEEP where
Two Kinds of Search Stochastic ✽ Random ✽ Outcome varies ✽ Must repeat the search to improve chances of success ✽ Feasible for bigger problems ✽ e.g. AutoDock Systematic ✽ Exhaustive ✽ Deterministic ✽ Outcome is dependent on granularity of sampling ✽ Feasible only for low dimensional problems ✽ e.g. DOT (6D)
Searching Algorithms Systematic search Molecular dynamics Monte Carlo Simulations Simulated annealing Genetic algorithms Lamarckian Genetic Algorithm Incremental construction
Systematic Search • Uniform sampling of search space • Relative position (3) • Relative orientation (3) • Rotatable bonds in ligand (n) • Rotatable bonds in protein (m) FRED [Yang04]
Systematic Search • Uniform sampling of search space • Exhaustive, deterministic • Quality dependent on granularity of sampling • Feasible only for low-dimensional problems • Example: search all rotations FRED [Yang04]
Molecular Mechanics • Energy minimization: • Start from a random or specific state(position, orientation, conformation) • Move in direction indicatedby derivatives of energy function • Stop when reach local minimum
Monte Carlo Simulations Tries to dock the ligand inside the receptor site through many random positions and rotations In ICM and MCDOCK, this method is used tomake random moves of the ligand inside a receptor binding site. After each random move, a force-field based energy minimization is applied. To avoid trapping in local minima, Monte Carlocombine this procedure with other search methods, such as Simulated Annealing, Genetic Algorithm and Lamarckian GA
Simulated Annealing • Global optimization technique based on the Monte Carlo method : • Start from a random or specific state (position, orientation, conformation) • Make random state changes, accepting up-hill moves with probability dictated by “temperature” • Reduce temperature after each move • Stop after temperature gets very small
Genetic Algorithm (GA) • Genetic search of parameter space: • Start with a random population of states • Perform random crossovers and mutations to make children • Select children with highest scoresto populate next generation • Repeat for a number of iterations Gold [Jones95], AutoDock [Morris98]
Lamarckian Genetic Algorithm • LGA finds lowest fitness function (energy) values first, then maps these values to their respective genotypes • Each new child is allowed to create a new generation • Genetic algorithm plus Solis and Wets local search Better performance than either simulated annealing or genetic algorithm alone
Incremental Extension • Used in DOCK, FLEXX, FLOG and Surflex • Greedy fragment-based construction: • Partition ligand into fragments
Incremental Extension • Greedy fragment-based construction: • Partition ligand into fragments • Place base fragment (e.g., with geometric hashing)
Incremental Extension • Greedy fragment-based construction: • Partition ligand into fragments • Place base fragment (e.g., with geometric hashing) • Incrementally extend ligand by attaching fragments
Descriptor Matching Methods: DOCK Distance-compatibility graph in DOCK(Ewing and Kuntz 1997): distances between sphere centers and distances between ligand heavy atoms
Descriptor Matching Methods Distance-compatibility graph in DOCK(Ewing and Kuntz 1997): distances between sphere centers and distances between ligand heavy atoms Interaction site matching in LUDI (Boehm 1992): HBA<->HBD, HYP<->HYP Pose clustering and triplet matching in FlexX (Rarey et al. 1996): HBA<->HBD, HYP<->HYP Shape-matching in FRED (Openeye www.eyesopen.com) Vector matching in CAVEAT (Lauri and Bartlett 1994) Steric effects-matching in CLIX (Lawrence and Davis 1992) Shape chemical complementarity in SANDOCK (Burkhard et al. 1998) Surface complementarity in LIGIN: (Sobolev et al. 1996) H-bond matching in ADAM (Mizutani et al. 1994)
Fragment-based Methods Flexibility and/or de novo design Identification and placement of the base/anchor fragment are very important Energy optimization (during or post-docking) is important Examples Incremental construction in FlexX with triplet matching and pose clustering to maximize the number of favorable interactions Growing and/or joining in LUDI from pre-built fragment and linker libraries and maximize H-bond and hydrophobic interactions Anchor-based fragment joining in DOCK
Molecular Simulation: MD & MC Two major components: The description of the degrees of freedom The energy evaluation The local movement of the atoms is performed Due to the forces present at each step in MD (Molecular Dynamics) Randomly in MC (Monte Carlo) Usually time consuming: Search from a starting orientation to low-energy configuration Several simulations with different starting orientation must be performed to get a statistically significant result Grid for energy calculation. Larger steps or multiple starting poses are often used for speed and sampling coverage in MD: Di Nola et al. 1994; Mangoni et al. 1999; Pak & Wang 2000; CDOCKER by Wu et al. 2003.
MC-based Docking where T is reduced based on a so-called cooling schedule, and grid can be used for energy calculation. An advantage of the MC technique compared with gradient-based methods (e.g. MD) is that a simple energy function can be used which does not require derivative information, and able to step over energy barrier. AutoDOCK (Goodsell & Olson 1990).MCDOCK (Liu & Wang 1999), PRODOCK (Trosset & Scheraga 1999), ICM (Abagyan et al. 1994). Simulated annealing is used in DockVision (Hart & Read 1992) and Affinity (Accelrys Inc., San Diego, CA) Energy minimization is used in QXP (McMartin & Bohacek 1997).
Genetic Algorithm Docking A fitness function is used to decide which individuals (configurations) survive and produce offspring for the next iteration of optimization. Degrees of freedom are encoded into genes or binary strings. The collection of genes (chromosome) is assigned a fitness based on a scoring function. There are three genetic operators: mutation operator randomly changes the value of a gene; crossover exchanges a set of genes from one parent chromosome to another; migration moves individual genes from one sub-population to another. Requires the generation of an initial population where conventional MC and MD require a single starting structure in their standard implementation. GOLD (Jones et al. 1997); AutoDock 3.0 (Morris et al. 1998); DIVALI (Clark & Ajay 1995).
DOCK (Kuntz, UCSF) • Receptor Structure • X-ray crystal • NMR • homology Binding ModeAnalysis for Lead Optimization: binding orientations and scores for each ligands Virtual Screening for MTS/HTS and Library Design: ligands in the order of their best scores Binding Site Scoring Orientations 1. Energy scoring (vdw and electrostatic) 2. Contact scoring (shape complementarity) 3. Chemical scoring 4. Solvation terms Molecular Surface of Binding Site Filters • Ligands • 3D structure • atomic charges • potentials • labeling Spheres describing the shape of binding site and favorable locations of potential ligand atoms Matching heavy atoms of ligands to centers of spheres to generate thousands of binding orientations
FlexX (Tripos/SYBYL) • Fragment-based, descriptor matching, empirical scoring (Rarey et al. 1996) • Procedures: • Select a small set of base fragment suitable for placement using a simple scoring function. • Place base fragments with the pose clustering algorithm: rigid, triplet matching of H-bond & hydrophobic interactions, Bohm's scoring function • Build up the remainder of the ligand incrementally from other fragments • Ligand conformations • MIMUMBA model with CSD derived low energy torsional angles for each rotatable bond and ring from CORINA. • Multiple conformations for each fragment in the ligand building steps • Other works: Explicit waters are placed into binding site during the docking procedure using pre-computed water positions(Rarey et al. 1999). Receptor flexibility using discrete alternative protein conformations (Claussen et al. 2001; Claussen & Hindle 2003)
GOLD • GA method, H-bond matching, FF scoring (Jones et al. 1997) • A configuration is represented by two bit strings: • The conformation of the ligand and the protein defined by the torsions; • A mapping between H-bond partners in the protein and the ligand. • For fitness evaluation, a 3D structure is created from the chromosome representation. The H-bond atoms are then superimposed to H-bond site points in the receptor site. • Fitness (scoring) function: H-bond, the ligand internal energy, the protein-ligand van der Waals energy • Rotational flexibility for selected receptor hydrogens along with full ligand flexibility • Highlights: • Validation test set: 100 complexes, 66 with rmsd<2A. • The structure generation is biased towards inter-molecular H-bonds. • Hydrophobic fitting points was added (GOLD 1.2, CCDC, Cambridge, UK 2001).