1 / 36

6. Machine Learning and Other Predictive Methods

6. Machine Learning and Other Predictive Methods. Chemical Space. Predictive Methods. Predict physical, chemical, and biological properties For example: 3D structure, NMR and mass spectra, boiling point, melting point, solubility (log P), toxicity, reaction rates, binding affinities, QSAR,……

leaton
Download Presentation

6. Machine Learning and Other Predictive Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6. Machine Learning and Other Predictive Methods

  2. Chemical Space 2

  3. Predictive Methods • Predict physical, chemical, and biological properties • For example: 3D structure, NMR and mass spectra, boiling point, melting point, solubility (log P), toxicity, reaction rates, binding affinities, QSAR,…… • Dock PDB to PubChem 3

  4. Methods • Spetrum of methods: • Schrodinger Equation • Molecular Dynamics • Machine Learning (e.g. SS prediction) 4

  5. Chemical Informatics • Informatics must be able to deal with variable-size structured data • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels 5

  6. Neural Networks • Feedforward applied to fingerprints (1D) • Recursive applied to bond graph (2D) • Directed Acyclic Graph • State vectors • Weight sharing 6

  7. Chemo/Bio Informatics Two Key Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: • Data (GenBank, Swissprot, PDB) • Similarity (BLAST) 7

  8. Organic Chemicals Fundamental Importance of Similarity Measures • Rapid Search of Large Databases • ProteinReceptor (Docking) • Small Molecule/Ligand (Similarity) • Predictive Methods (Kernel Methods) 8

  9. Classification • Learning to Classify • Limited number of training examples (molecules, patients, sequences, etc.) • Learning algorithm (how to build the classifier?) • Generalization: should correctly classify test data. • Formalization • X is the input space • Y (e.g. toxic/non toxic, or {1,-1}) is the target class • f: X→Y is the classifier. 9

  10. Linear Classifiers 10

  11. Classification Fundamental Point: f is entirely determined by the dot products <xixj> measuring similarity between pairs of data points 11

  12. Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel. 12

  13. Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel K. • Fundamental property: the linear decision surface depends on K(xi ,xj)=<φ(xi ) , φ(xj)>. • All we need is the Gram similarity matrix K. K defines the local metric of the embedding space. 13

  14. Finding a Good Kernel • Given: Two molecules. • Task: Systematically compute relevant similarity while being storage/time efficient. • Motivation: Enable efficient application of search and kernel algorithms. 14

  15. Similarity: Data Representations NC(O)C(=O)O 15

  16. CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO 15 Total: 1D SMILES Kernel 16

  17. 2D Molecule Graph Kernel • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Count labeled paths • Fingerprints (CsNsCdO) 17

  18. A B a c b Similarity for Binary Fingerprints • Tally features: • Unique (a,b) • In common (c) • Similarity Formula • Tanimoto=c/(a+b+c) • Tversky(α,β)=c/(a*α+b*β+c) 18

  19. Similarity Measures 19

  20. 2.8 A 2.0 A 4.2 A 1.4 A 3.4 A 3D Coordinate Kernel 20

  21. Mutag 230 chemicals Mutagenicity in Salmonella. 125 positive/63 negative. Leave-one-out cross validation. PTC Several hundred chemicals. Toxicity / carcinogenicity in male and female mice and rats. Leave-one-out cross validation. NCI Several thousand chemicals. Growth Inhibition in 60 tumor cell lines. Close to 50/50. 20 random 80/20 cross validated splits. Datasets 21

  22. Examples of Results:Mutag and PTC 22

  23. Results 23

  24. Example of Results (NCI) 24

  25. Example of Results:NCI Accuracy/ROC 25

  26. Comparison of Kernels (NCI) 26

  27. Regression:Aqueous Solubility 30 folds cross-validation Delaney Dataset: 1440 Examples 27

  28. XLogP 40 folds cross-validation Dataset size: 1991 S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L. Ralaivola, and P. Baldi. Kernels for Small Molecules and the Prediction of Mutagenicity, Toxicity, and Anti-Cancer Activity. Proceedings of the 2005 Conference on Intelligent Systems for Molecular Biology, ISMB 05. Bioinformatics, 21, Supplement 1, i359-368, (2005). 28

  29. Additional Representations 1D SMILES string 2D Atomic connection table 3D XYZ coordinates of labeled points 2.5D 2D surface in 3D space NC(CO)C(=O)O 4D Bag of conformers as XYZ coordinates of labeled points Multiple Conformers: 3.5D Bag of conformers in 2D surface in 3D space 29

  30. 2.5D Surface Kernel • Build a graph G (V = atoms) which approximates the surface (convex hull). • Use spectral graph kernels on G. 30

  31. 2.5D Surface Kernel • Compute regular/Delauney tessellation (tetrahedrization) of the convex hull of the atoms in the molecule • Use alpha-shape algorithm to detect surface triangles at relevant scale (keep interior and regular edges, remove singular edges, r on the order of water + carbon radius) • This yields a triangulated graph that approximates the surface (average degree 6). • Use spectral kernel with paths (l=3,4) on the triangulated surface graph. 31

  32. Alpha Shape • The shape formed by a set of points. • Closely related solvent accessible surface. • Calculated in O(n*log(n)) using CGAL http://www.cgal.org/Manual/doc_html/cgal_manual/Alpha_shapes_3/Chapter_main.html 32

  33. The Conformer Problem • Atoms connected by proximity • Different conformers have different graphs and features. 33

  34. 2.5D + Conformers = 3.5D Molecule A Molecule B 34

  35. Molecular Representations and Kernels • 1D: SMILES strings • 2D: Graph of bonds • 2D: Surfaces • 2.5D: Conformers • 3D: Atomic coordinates (Pharmacophores, Epitopes) • 3.5D: Conformers • 4D: Temporal evolution • 4D: Isomers 35

  36. Summary • ChemDB and other resources • Variety of kernels for small molecules • State-of-the-art performance on several benchmark datasets • For now, 2D kernels slightly better than 1D and 3D kernels • Many possible extensions: 2.5D, 3D, 3.5D, 4D kernels • Need for larger data sets and new models of cooperation in the chemistry community • Many open (ML) questions (e.g. clustering and visualizing 107 compounds, intelligent recognition of useful molecules/reactions, retrosynthesis, prediction of reaction rates, information retrieval from literature, docking, matching table of all proteins against all known compounds, origin of life, etc.) 36

More Related