Machine Learning in Drug Design

Machine Learning in Drug Design David Page Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences

Michael Waddell Paul Finn Ashwin Srinivasan John Shaughnessy Bart Barlogie Frank Zhan Stephen Muggleton Arno Spatola Sean McIlwain Brian Kay Collaborators

Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

Drugs Typically Are… • Small organic molecules that… • Modulate disease by binding to some target protein… • At a location that alters the protein’s behavior (e.g., antagonist or agonist). • Target protein might be human (e.g., ACE for blood pressure) or belong to invading organism (e.g., surface protein of a bacterium).

Example of Binding

So To Design a Drug: Identify Target Protein Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound Determine Target Site Structure Synthesize a Molecule that Will Bind Imperfect modeling of structure Structures may change at binding And even then…

Molecule Binds Target But May: • Bind too tightly or not tightly enough. • Be toxic. • Have other effects (side-effects) in the body. • Break down as soon as it gets into the body, or may not leave the body soon enough. • It may not get to where it should in the body (e.g., crossing blood-brain barrier). • Not diffuse from gut to bloodstream.

And Every Body is Different: • Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). • A molecule may work for some people but not others. • A molecule may cause harmful side-effects in some people but not others.

Places to use Machine Learning • Finding target proteins. • Inferring target site structure. • Predicting who will respond positively/negatively.

Healthy vs. Disease Healthy Diseased

If We Could Sequence DNA Quickly and Cheaply, We Could: • Sequence DNA of people taking a drug, and use ML to identify consistent differences between those who respond well and those who do not. • Sequence DNA of cancer cells and healthy cells, and use ML to detect dangerous mutations… proteins these genes code for may be useful targets. • Sequence DNA of people who get a disease and those who don’t, and use ML to determine genes related to succeptibility… proteins these genes code for may be useful targets.

Problem: Can’t Sequence Quickly • Can quickly test single positions where variation is common: Single Nucleotide Polymorphisms (SNPs). • Can quickly test degree to which every gene is being transcribed: Gene Expression Microarrays (e.g., Affymetrix Gene Chips™). • Can (moderately) quickly test which proteins are present in a sample (Proteomics).

Example of SNP Data

Problem: SNPs are not Genes • If we find a predictive SNP, it may not be part of a gene… we can only infer that the SNP is “near” a gene that may be involved in the disease. • Even if the SNP is part of a gene, it may be another nearby gene that is the key gene.

Problem: Even SNPs are Costly • Typically cannot use all known SNPs. • Can focus on a particular chromosome and area if knowledge permits that. • Can use a scattering of SNPs, since SNPs that are very close together may be redundant… use one SNP per haplotype block, or region where recombination is rare.

Why Machine Learning? • There may be no single SNP in our data that distinguishes disease vs. healthy. • Still may be possible to have some combination of SNPs to predict. Can gain insight from this combination.

Decision Trees in One Picture

Naïve Bayes in One Picture Age SNP 3000 SNP 1 SNP 2 . . .

Voting Approach • Score SNPs using information gain. • Choose top 1% scoring SNPs. • To classify a new case, let these SNPs vote (majority or weighted majority vote). • We use majority vote here.

Task: Predict Early Onset DiseaseFrom SNP Data • Only 3000 SNPs, coarsely sampled over entire genome. • 80 patients (examples), 40 with early onset. • Using technology from Orchid. • Can a predictor be learned that performs significantly better than chance on unseen data?

Results • Use all data, only top 1% of features, or only top 10% of features (according to decision tree’s purity measure). • Use Trees, SVMs, Voting. • SVMs with top 10% achieve 71% accuracy. Significantly better than chance (50%).

Lessons • Feature selection is important for performance. • Methodology note for machine learning specialists: must repeat this entire process on each fold of cross-validation or results will be overly-optimistic. • SNP approach is promising… get funding to measure more SNPs. • More work on SVM comprehensibility.

Places to use Machine Learning • Finding target proteins. • Inferring target site structure. • Predicting who will respond positively/negatively.

Typical Practice when Target Structure is Unknown • Test many molecules (1,000,000) to find some that bind to target (ligands). • Infer (induce) shape of target site from 3D structural similarities. • Shared 3D substructure is called a pharmacophore. • Perfect example of a machine learning task with spatial target.

An Example of Structure Learning Inactive Active

Inductive Logic Programming • Represents data points in mathematical logic • Uses Background Knowledge • Returns results in logic

The Logical Representation of a Pharmacophore

Background Knowledge I • Information about atoms and bonds in the molecules • atm(m1,a1,o,3,5.915800,-2.441200,1.799700). • atm(m1,a2,c,3,0.574700,-2.773300,0.337600). • atm(m1,a3,s,3,0.408000,-3.511700,-1.314000). • bond(m1,a1,a2,1). • bond(m1,a2,a3,1).

Background knowledge II • Definition of distance equivalence • dist(Drug,Atom1,Atom2,Dist,Error):- • number(Error), • coord(Drug,Atom1,X1,Y1,Z1), • coord(Drug,Atom2,X2,Y2,Z2), • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), • Diff is Dist1-Dist, • absolute_value(Diff,E1), • E1 =< Error. • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- • Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, • D is sqrt(Dsq).

Central Idea: Generalize by searching a lattice

Conformational model • Conformational flexibility modelled as multiple conformations: • Sybyl randomsearch • Catalyst

Pharmacophore description • Atom and site centred • Hydrogen bond donor • Hydrogen bond acceptor • Hydrophobe • Site points (limited at present) • User definable • Distance based

Example 1: Dopamine agonists • Agonists taken from Martin data set on QSAR society web pages • Examples (5-50 conformations/molecule)

Pharmacophore identified • Molecule A has the desired activity if: • in conformation B molecule A contains a hydrogen acceptor at C, and • in conformation B molecule A contains a basic nitrogen group at D, and • the distance between C and D is 7.05966 +/- 0.75 Angstroms, and • in conformation B molecule A contains a hydrogen acceptor at E, and • the distance between C and E is 2.80871 +/- 0.75 Angstroms, and • the distance between D and E is 6.36846 +/- 0.75 Angstroms, and • in conformation B molecule A contains a hydrophobic group at F, and • the distance between C and F is 2.68136 +/- 0.75 Angstroms, and • the distance between D and F is 4.80399 +/- 0.75 Angstroms, and • the distance between E and F is 2.74602 +/- 0.75 Angstroms.

Example II: ACE inhibitors • 28 angiotensin converting enzyme inhibitors taken from literature • D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 3-16, (1987)

Experiment 1 • Attempt to identify pharmacophore using original Mayer et al. Data (final conformations). • Initial failed attempt traced to “bugs” in background knowledge definition. • 4 pharmacophores found with corrected code (variations on common theme)

ACE pharmacophore • Molecule A is an ACE inhibitor if: • molecule A contains a zinc-site B, • molecule A contains a hydrogen acceptor C, • the distance between B and C is 7.899 +/- 0.750 A, • molecule A contains a hydrogen acceptor D, • the distance between B and D is 8.475 +/- 0.750 A, • the distance between C and D is 2.133 +/- 0.750 A, • molecule A contains a hydrogen acceptor E, • the distance between B and E is 4.891 +/- 0.750 A, • the distance between C and E is 3.114 +/- 0.750 A, • the distance between D and E is 3.753 +/- 0.750 A.

B A C Pharmacophore discovered Zinc site H-bond acceptor

Experiment 2 • Definition of “zinc ligand” added to background knowledge • based on crystallographic data • Multiple conformations • Sybyl RandomSearch

4.0 3.9 7.3 Experiment 2 • Original pharmacophore rediscovered plus one other • different zinc ligand position • similar to alternative proposed by Ciba-Geigy

Example III: Thermolysin inhibitors • 10 inhibitors for which crystallographic data is available in PDB • Conformationally challenging molecules • Experimentally observed superposition

Key binding site interactions Asn112-NH O=C Asn112 S2’ Arg203-NH S1’ O=C Ala113 Zn

Interactions made by inhibitors

Pharmacophore Identification • Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN • Conformational analysis using “Best” conformer generation in Catalyst • 98-251 conformations/molecule

Machine Learning in Drug Design

Machine Learning in Drug Design

Presentation Transcript

Topics in Machine Learning

Drug Design / drug discovery

Machine Learning in Bioinformatics

Machine Learning

Machine Learning

Machine Learning in DryadLINQ

Machine learning in IDS

Submodularity in Machine Learning

Drug Design / drug discovery

Machine Learning in realtime

Machine Learning in GATE

Applying Machine Learning to Circuit Design

Experiments in Machine Learning

Evaluation in Machine Learning

Machine Learning in Football

Chemoinformatics in Drug Design

Machine learning Courses | Machine Learning Training

Experiments in Machine Learning

Machine learning in IDS

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn

How Machine Learning Helps Drug Discovery Services

Machine Learning In Kolkata