630 likes | 698 Views
Explore the synergies between clever algorithms and domain knowledge to mine discriminative molecular fragments in drug discovery. Learn about finding clusters in chemistry space and how to cater to chemists' needs for imprecision.
E N D
Mining ImpreciseDiscriminativeMolecular Fragments Michael R. Berthold Data Analysis Research Group, Tripos, Inc. South San Francisco, California, USAandALTANA-Chair for Bioinformatics and Information Mining University of Konstanz, Germany Email: berthold@inf.uni-konstanz.de
Outline • Motivation: Imprecise Data in BioInformatics • Drug Discovery and High Throughput Screening • Finding Clusters in Chemistry Space • Synergies: Clever Algorithms and Domain Knowledge • Mining Molecular Fragments • What the Chemist really wants: Imprecision(Fuzzy Atoms and flexible Chains) • Some Experimental Results (NCI HIV screens) • Conclusions
Drug Discovery… Classic: • Expert Knowledge available: • Metabolic pathway information • Binding site information • After Specific Target is identified: • Generate Assay to identify desirable effect • Assemble & Test (focused) library of compounds • First Phase: High Throughput Screening (HTS)Often hundreds of thousands of molecules tested in highly automated fashion • …After clever data analysis… • Second Phase: Test a few hundred compounds more carefully (IC50)
…Drug Discovery • And then (in the remaining 8-9 years): • Animal Testing • Several rounds of clinical testing • Approval procedures • And most often: late stage failure • Go back to start, do not collect $1,000,000,000 • Lead Rescue: eliminate side effects (ADME/Tox, cardiac effects, sometimes also avoid patents…) avoid bad areas in “drug space” (lead hopping)
High Throughput Screening Rapidly screen 100-thousand’s of candidates. • Problems • Often thousands of actives • Data extremely noisy(up to 50% false positives, unknown false negatives!) • Positives almost always active for different reasons Separate, diverse clusters! Goal:Find common properties among similar subsets of active molecules(help user understand activity patterns!)
Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible.
Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible. • For more molecules, automated methods are needed…
Molecular Fragment Miner (MoFa) [Ch. Borgelt, M. R. Berthold, IEEE Data Mining, 2002.] Goal: • Find Fragments that are discriminative for a class of interest (high activity, good synthesis result, …): • Appear often in Positives: freq(high activity)>threshold • Appear rarely in Negatives: freq(low activity)<delta MoFa: • Based on Market Basket Analysis (Eclat Algorithm) • Grow Fragment-Candidates from scratch atom-by-atom • Only report significant and unique fragments
Example • 6 Example “Molecules” • Find all unique fragments that occur in 4 Molecules O O O = = = _ _ _ _ _ _ = C C S N C C S N C S N = _ = _ O C C N N N = _ = _ _ _ _ _ _ = C C S N C S N C S O
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Examples: (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C C N O S Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
a b c a,b a,c b,c Duplicate Fragments! • How do Apriori, Eclat & Co avoid Duplicate Itemsets? Prefix Tree a,b,c BUT: Prefix Tree requires a global order defined on items…
Local Order on Atoms/Bonds • Global order on atoms/bonds is not possible • Use local order on atoms: C < N < O < S • In case of same atom type, use secondary order based on bond: single (-) < aromatic < double (=) < triple • Higher (or equal) extensions are only allowed on last atom extended, and • All extensions are allowed on atoms inserted after last atom extended.
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N || N S-N || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)
Support Based Pruning • Support of fragment A:supp(A) = Frequency of appearance in molecules • Monotone conditions decline with size of fragment:fragment A is contained in fragment B supp(A) supp(B) • If supp(node) in branch is below thresholdthen all child-nodes will also be below threshold.
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=4 #=4 #=4 S-C | N S-C || N S-C || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=4 Some fragments which are not reported (due to redundant support): #=6 #=6 S || N C S Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)
Discriminative Fragments • Just finding frequent fragments usually not interesting • Find fragments that are • frequent in one class of molecules • and infrequent in the remainder of molecules • Discriminative Fragments summarize shared properties. • Number of actives and inactives (and the ratio) that contain fragment indicates relevance.
Example:[NCI HIV dataset ~45000 (~400 active) compounds, threshold=15%] ….. 15.08% vs. 0.02%
A few more fragments… 5.23% vs. 0.05% 5.23% vs. 0.08% 4.92 vs. 0.07% 9.85% vs. 0.07% 9.85% vs. 0.0% 10.15% vs. 0.04%
Two of the underlying molecules: Problems… However, some fragments puzzled our chemists…
Chemists’ view • Strict graph-based view of molecules is too restrictive • Some tolerances do not affect function, e.g.: • In a specific context, some atoms may be of different type(e.g. N/C equivalence in aromatic rings, all halogens are equivalent, …) • The exact length of a chain connecting two rigid substructures does not matter(e.g. chains of CH2 can be 2-4 carbons long, …)
Fuzzy Matches[H. Hofer, Ch. Borgelt, M. R. Berthold, IDA, Berlin, 2003] Specifying wildcards via equivalence classes, here • Meta Atoms: Certain atoms can be matched • Maximum number of fuzzy-atoms allowed • Equivalence classes can overlap (e.g. {O,C} and {C,N}) • Fuzzy Chains: Model flexible chains explicitly • Specify min/max length of chains
Fuzzy Matches- Fuzzy Atom Matches (HIV data): Cl N {O,N} S Cl N {O,S} S O N O S CA 5.5% 3.7% CA 5.5% 0.01% CI 0.0% 0.0% CI 0.0% 0.0%
MoFa - Summary • Search based on parallel embeddingsand large scale data mining algorithm (Apriori/Eclat) • Computationally very efficient • Discovered knowledge is immediately meaningful • Fragments understandable to chemist • Better than rules/decision trees on mystic attributes • Really useful after incorporating Expert Feedback re. Imprecisions: • Markush structures: allow for wildcards in fragments(fuzzy atoms and chains of flexible length) • Applied successfully to HTS data analysis, chemical synthesis success prediction.
Thank you. Preprints/Remarks/further Questions:send eMail toberthold@inf.uni-konstanz.de
Conclusions… • Data Analysis in Life Sciences is inherently: • multi-disciplinary • Imprecise • Interactive • context-dependent notions of similarity • Focus is not exclusively on building good predictors • Instead the user wants understandable pieces of knowledge (“Information Mining”). • Value of knowledge depends on archival… • Store&Retrieve past “experience” • … and on usability
What is “Similarity”? Tropacocaine 1518-12246 – Local Anesthetic
Types of Molecular Similarity • Structural similarity: • Same basic layout of overall graph • …or at least existence of a common subgraph • Geometrical similarity: • Roughly same shape in 3D, independent of exact atom matches • Instead of simple shape, also other properties (surface charge…) can be compared • Global properties: • Molecular weight • Number of hitrogen donors/acceptors… • And many others…
Knowledge Recycling Hardly ever do we find precise fits • find similar structures • chemical similarity • activity related similarity • … • determine related context • cardiac effects vs. ion channel effects (hERG assay) • appear in same metabolic pathway • Related gene expression profiles • … • and finally draw (inherently imprecise!) inferences Knowledge Archival, Management and Usability are crucial.