Machine learning in drug design
Download
1 / 57

Machine Learning in Drug Design - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning in Drug Design. David Page Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences. Michael Waddell Paul Finn Ashwin Srinivasan John Shaughnessy Bart Barlogie. Frank Zhan Stephen Muggleton Arno Spatola Sean McIlwain Brian Kay. Collaborators.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Machine Learning in Drug Design

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine learning in drug design
Machine Learning in Drug Design

David Page

Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences


Collaborators

Michael Waddell

Paul Finn

Ashwin Srinivasan

John Shaughnessy

Bart Barlogie

Frank Zhan

Stephen Muggleton

Arno Spatola

Sean McIlwain

Brian Kay

Collaborators


Outline
Outline

  • Overview of Drug Design

  • How Machine Learning Fits Into the Process

  • Target Search: Single Nucleotide Polymorphisms (SNPs)

  • Machine Learning from Feature Vectors

    • Decision Trees

    • Support Vector Machines

    • Voting/Ensembles

  • Predicting Molecular Activity: Learning from Structure


Drugs typically are
Drugs Typically Are…

  • Small organic molecules that…

  • Modulate disease by binding to some target protein…

  • At a location that alters the protein’s behavior (e.g., antagonist or agonist).

  • Target protein might be human (e.g., ACE for blood pressure) or belong to invading organism (e.g., surface protein of a bacterium).



So to design a drug
So To Design a Drug:

Identify Target

Protein

Knowledge of proteome/genome

Relevant biochemical pathways

Crystallography, NMR

Difficult if Membrane-Bound

Determine

Target Site

Structure

Synthesize a

Molecule that

Will Bind

Imperfect modeling of structure

Structures may change at binding

And even then…


Molecule binds target but may
Molecule Binds Target But May:

  • Bind too tightly or not tightly enough.

  • Be toxic.

  • Have other effects (side-effects) in the body.

  • Break down as soon as it gets into the body, or may not leave the body soon enough.

  • It may not get to where it should in the body (e.g., crossing blood-brain barrier).

  • Not diffuse from gut to bloodstream.


And every body is different
And Every Body is Different:

  • Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).

  • A molecule may work for some people but not others.

  • A molecule may cause harmful side-effects in some people but not others.


Outline1
Outline

  • Overview of Drug Design

  • How Machine Learning Fits Into the Process

  • Target Search: Single Nucleotide Polymorphisms (SNPs)

  • Machine Learning from Feature Vectors

    • Decision Trees

    • Support Vector Machines

    • Voting/Ensembles

  • Predicting Molecular Activity: Learning from Structure


Places to use machine learning
Places to use Machine Learning

  • Finding target proteins.

  • Inferring target site structure.

  • Predicting who will respond positively/negatively.


Places to use machine learning1
Places to use Machine Learning

  • Finding target proteins.

  • Inferring target site structure.

  • Predicting who will respond positively/negatively.


Healthy vs disease
Healthy vs. Disease

Healthy

Diseased


If we could sequence dna quickly and cheaply we could
If We Could Sequence DNA Quickly and Cheaply, We Could:

  • Sequence DNA of people taking a drug, and use ML to identify consistent differences between those who respond well and those who do not.

  • Sequence DNA of cancer cells and healthy cells, and use ML to detect dangerous mutations… proteins these genes code for may be useful targets.

  • Sequence DNA of people who get a disease and those who don’t, and use ML to determine genes related to succeptibility… proteins these genes code for may be useful targets.


Problem can t sequence quickly
Problem: Can’t Sequence Quickly

  • Can quickly test single positions where variation is common: Single Nucleotide Polymorphisms (SNPs).

  • Can quickly test degree to which every gene is being transcribed: Gene Expression Microarrays (e.g., Affymetrix Gene Chips™).

  • Can (moderately) quickly test which proteins are present in a sample (Proteomics).


Outline2
Outline

  • Overview of Drug Design

  • How Machine Learning Fits Into the Process

  • Target Search: Single Nucleotide Polymorphisms (SNPs)

  • Machine Learning from Feature Vectors

    • Decision Trees

    • Support Vector Machines

    • Voting/Ensembles

  • Predicting Molecular Activity: Learning from Structure



Problem snps are not genes
Problem: SNPs are not Genes

  • If we find a predictive SNP, it may not be part of a gene… we can only infer that the SNP is “near” a gene that may be involved in the disease.

  • Even if the SNP is part of a gene, it may be another nearby gene that is the key gene.


Problem even snps are costly
Problem: Even SNPs are Costly

  • Typically cannot use all known SNPs.

  • Can focus on a particular chromosome and area if knowledge permits that.

  • Can use a scattering of SNPs, since SNPs that are very close together may be redundant… use one SNP per haplotype block, or region where recombination is rare.


Why machine learning
Why Machine Learning?

  • There may be no single SNP in our data that distinguishes disease vs. healthy.

  • Still may be possible to have some combination of SNPs to predict. Can gain insight from this combination.


Outline3
Outline

  • Overview of Drug Design

  • How Machine Learning Fits Into the Process

  • Target Search: Single Nucleotide Polymorphisms (SNPs)

  • Machine Learning from Feature Vectors

    • Decision Trees

    • Support Vector Machines

    • Voting/Ensembles

  • Predicting Molecular Activity: Learning from Structure



Na ve bayes in one picture
Naïve Bayes in One Picture

Age

SNP 3000

SNP 1

SNP 2

. . .


Voting approach
Voting Approach

  • Score SNPs using information gain.

  • Choose top 1% scoring SNPs.

  • To classify a new case, let these SNPs vote (majority or weighted majority vote).

  • We use majority vote here.


Task predict early onset disease from snp data
Task: Predict Early Onset DiseaseFrom SNP Data

  • Only 3000 SNPs, coarsely sampled over entire genome.

  • 80 patients (examples), 40 with early onset.

  • Using technology from Orchid.

  • Can a predictor be learned that performs significantly better than chance on unseen data?


Results
Results

  • Use all data, only top 1% of features, or only top 10% of features (according to decision tree’s purity measure).

  • Use Trees, SVMs, Voting.

  • SVMs with top 10% achieve 71% accuracy. Significantly better than chance (50%).


Lessons
Lessons

  • Feature selection is important for performance.

  • Methodology note for machine learning specialists: must repeat this entire process on each fold of cross-validation or results will be overly-optimistic.

  • SNP approach is promising… get funding to measure more SNPs.

  • More work on SVM comprehensibility.


Outline4
Outline

  • Overview of Drug Design

  • How Machine Learning Fits Into the Process

  • Target Search: Single Nucleotide Polymorphisms (SNPs)

  • Machine Learning from Feature Vectors

    • Decision Trees

    • Support Vector Machines

    • Voting/Ensembles

  • Predicting Molecular Activity: Learning from Structure


Places to use machine learning2
Places to use Machine Learning

  • Finding target proteins.

  • Inferring target site structure.

  • Predicting who will respond positively/negatively.


Typical practice when target structure is unknown
Typical Practice when Target Structure is Unknown

  • Test many molecules (1,000,000) to find some that bind to target (ligands).

  • Infer (induce) shape of target site from 3D structural similarities.

  • Shared 3D substructure is called a pharmacophore.

  • Perfect example of a machine learning task with spatial target.



Inductive logic programming
Inductive Logic Programming

  • Represents data points in mathematical logic

  • Uses Background Knowledge

  • Returns results in logic



Background knowledge i
Background Knowledge I

  • Information about atoms and bonds in the molecules

  • atm(m1,a1,o,3,5.915800,-2.441200,1.799700).

  • atm(m1,a2,c,3,0.574700,-2.773300,0.337600).

  • atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).

  • bond(m1,a1,a2,1).

  • bond(m1,a2,a3,1).


Background knowledge ii
Background knowledge II

  • Definition of distance equivalence

  • dist(Drug,Atom1,Atom2,Dist,Error):-

  • number(Error),

  • coord(Drug,Atom1,X1,Y1,Z1),

  • coord(Drug,Atom2,X2,Y2,Z2),

  • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),

  • Diff is Dist1-Dist,

  • absolute_value(Diff,E1),

  • E1 =< Error.

  • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):-

  • Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2,

  • D is sqrt(Dsq).



Conformational model
Conformational model

  • Conformational flexibility modelled as multiple conformations:

    • Sybyl randomsearch

    • Catalyst


Pharmacophore description
Pharmacophore description

  • Atom and site centred

    • Hydrogen bond donor

    • Hydrogen bond acceptor

    • Hydrophobe

    • Site points (limited at present)

    • User definable

  • Distance based


Example 1 dopamine agonists
Example 1: Dopamine agonists

  • Agonists taken from Martin data set on QSAR society web pages

  • Examples (5-50 conformations/molecule)


Pharmacophore identified
Pharmacophore identified

  • Molecule A has the desired activity if:

  • in conformation B molecule A contains a hydrogen acceptor at C, and

  • in conformation B molecule A contains a basic nitrogen group at D, and

  • the distance between C and D is 7.05966 +/- 0.75 Angstroms, and

  • in conformation B molecule A contains a hydrogen acceptor at E, and

  • the distance between C and E is 2.80871 +/- 0.75 Angstroms, and

  • the distance between D and E is 6.36846 +/- 0.75 Angstroms, and

  • in conformation B molecule A contains a hydrophobic group at F, and

  • the distance between C and F is 2.68136 +/- 0.75 Angstroms, and

  • the distance between D and F is 4.80399 +/- 0.75 Angstroms, and

  • the distance between E and F is 2.74602 +/- 0.75 Angstroms.


Example ii ace inhibitors
Example II: ACE inhibitors

  • 28 angiotensin converting enzyme inhibitors taken from literature

    • D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 3-16, (1987)


Experiment 1
Experiment 1

  • Attempt to identify pharmacophore using original Mayer et al. Data (final conformations).

  • Initial failed attempt traced to “bugs” in background knowledge definition.

  • 4 pharmacophores found with corrected code (variations on common theme)


Ace pharmacophore
ACE pharmacophore

  • Molecule A is an ACE inhibitor if:

  • molecule A contains a zinc-site B,

  • molecule A contains a hydrogen acceptor C,

  • the distance between B and C is 7.899 +/- 0.750 A,

  • molecule A contains a hydrogen acceptor D,

  • the distance between B and D is 8.475 +/- 0.750 A,

  • the distance between C and D is 2.133 +/- 0.750 A,

  • molecule A contains a hydrogen acceptor E,

  • the distance between B and E is 4.891 +/- 0.750 A,

  • the distance between C and E is 3.114 +/- 0.750 A,

  • the distance between D and E is 3.753 +/- 0.750 A.


Pharmacophore discovered

B

A

C

Pharmacophore discovered

Zinc site

H-bond acceptor


Experiment 2
Experiment 2

  • Definition of “zinc ligand” added to background knowledge

    • based on crystallographic data

  • Multiple conformations

    • Sybyl RandomSearch


Experiment 21

4.0

3.9

7.3

Experiment 2

  • Original pharmacophore rediscovered plus one other

    • different zinc ligand position

    • similar to alternative proposed by Ciba-Geigy


Example iii thermolysin inhibitors
Example III: Thermolysin inhibitors

  • 10 inhibitors for which crystallographic data is available in PDB

  • Conformationally challenging molecules

  • Experimentally observed superposition


Key binding site interactions
Key binding site interactions

Asn112-NH

O=C Asn112

S2’

Arg203-NH

S1’

O=C Ala113

Zn



Pharmacophore identification
Pharmacophore Identification

  • Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN

  • Conformational analysis using “Best” conformer generation in Catalyst

  • 98-251 conformations/molecule


Thermolysin results
Thermolysin Results

  • 10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules)

    • 3 “acceptors”, 1 hydrophobe, 1 donor

    • 4 “acceptors, 1 donor

  • Common core of Zn ligands, Arg203 and Asn112 interactions identified

  • Correct assignments of functional groups

  • Correct geometry to 1 Angstrom tolerance


Thermolysin results1
Thermolysin results

  • Increasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction


Example iv antibacterial peptides
Example IV: Antibacterial peptides

  • Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa

    • 6 actives <64mg/ml IC50

    • 5 inactives


Pharmacophore identified1
Pharmacophore Identified

A Molecule M is active against Pseudomonas Aeruginosa

if it has a conformation B such that:

M has a hydrophobic group C,

M has a hydrogen acceptor D,

the distance between C and D in conformation B is 11.7 Angstroms

M has a positively-charged atom E,

the distance between C and E in conformation B is 4 Angstroms

the distance between D and E in conformation B is 9.4 Angstroms

M has a positively-charged atom F,

the distance between C and F in conformation B is 11.1 Angstroms

the distance between D and F in conformation B is 12.6 Angstroms

the distance between E and F in conformation B is 8.7 Angstroms

Tolerance 1.5 Angstroms


Ongoing ilp developments pharmacophores
Ongoing ILP developments (pharmacophores)

  • Continue to extend method validation

  • Extending to combinatorial mixtures

  • Quantitative models

  • Mixing different datatypes in background knowledge

  • Developing graphical front-end


Ongoing developments other
Ongoing developments (Other)

  • Analysis of HTS datasets

  • Analysis of “drug-likeness”

  • Derivation of new descriptors

    • eg Empirical binding functions


ad
  • Login