1 / 30

2. Molecular Representations

2. Molecular Representations. Communicating Chemical Data. Chemical data: Text, numbers, and molecules Standard valence model of chemistry Discrete bonds represent shared electrons Codify into a reproducible representation Graph of atoms and bonds is most commonly

lapan
Download Presentation

2. Molecular Representations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2. Molecular Representations

  2. Communicating Chemical Data • Chemical data: Text, numbers, and molecules • Standard valence model of chemistry • Discrete bonds represent shared electrons • Codify into a reproducible representation • Graph of atoms and bonds is most commonly understood representation 2

  3. 2D Graph of Atoms / Bonds • Labeled graph • Nodes = Atoms, Symbols = {C, N, O, H, …} • Edges = Bonds, Order = {1, 2, 3, aromatic,..} • Organic compound shorthand • Assumed carbons • Implicit hydrogens • Standard valence rules 3

  4. Tractability • Small, Tree-Like, Graphs • Number of vertices is small (e.g. less than 50) • Number of edges is small (average degree ~2.3 or so) • Tree-like 4

  5. 3 1 2 4 2D Data Formats • Bond matrix formats exist but size ~ nAtoms2 • Connection table • List of nodes • C1, C2, O3, N4 • List of edges • 1-2, 2=3, 2-4 • SDFile, Mol2 Formats • Not human writeable 5

  6. 1D Line Notations • Should be human parseable to facilitate communication without computer module • Nomenclature • IUPAC system: 2-amino-3-phenyl-propanoic acid • Common names: phenylalanine • SMILES: C(C(O)=O)(Cc1ccccc1)N • Widely used, non-standardized • InChi • Recent, IUPAC supported official standard • Ex. 1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12) 6 http://www.iupac.org/inchi/

  7. IUPAC Nomenclature • IUPAC Standard Naming Conventions • propane • propanoic acid • 3-hydroxy-propanoic acid • 2-amino-3-hydroxy-propanoic acid • Unwieldy standard and inconsistent adoption • “Common” names and abbreviations (Serine) • Systematic bidirectional translation unreliable 7

  8. SMILES Basics • Connection tables as a character string • Atoms: Atomic symbols {C,N,O,S, …} • Bonds: single “-” (implicit), double “=“, triple “#” • Examples • CBr • C=O • C#N • O=CC#N 8 http://www.daylight.com/smiles/

  9. SMILES Basics • Branching: Parentheses • Cycles: Numerical annotations • CCC(O)C • CC(N)(N)O • C1CCCC1 • N12CCCCC1CCCC2 • N#CC(C#N)N1C=CC=C1 • Extensions for • Inorganic atoms, unusual valence, formal charges, stereochemistry, aromaticity, reactions, etc. 9

  10. Canonical Representations • Unique representation needed for rapid DB lookup and to check uniqueness • Need to uniquely order the atoms of a molecule • nAtoms! atom orderings possible • Morgan Algorithm • Label nodes by connectivity (heavy degree) • Relax iteratively towards extended connectivity (EC) using neighbor values • Use EC magnitude to decide on atom order • EC “tie-breaking” by atom, bond distinctions 10

  11. Stereochemistry / Isomers • Chemical “handedness” • Same connectivity, but not superimposable • Atoms with at least 4 distinct components • Double bonds with distinct components at ends • Specification by atom / bond labels e.g. O/C=C/N vs. O/C=C\N e.g. C[C@H](N)O vs. C[C@@H](N)O 11

  12. 3D Atomic Coordinates • 2D graph only specifies connections • 3D spatial coordinates (center, radius, surface) • Largely unavailable • Usually predicted 12

  13. 4D Conformers • Molecules are relatively rigid w.r.t. • Bond length • Bond angles • Single bonds are very flexible w.r.t rotation • More information with collection of multiple, static 3D conformations 13

  14. Molecular Surfaces • For intermolecular interactions, externally “visible” surface is most important • Representations: Orbitals,VDW Radii, Accessibility, Tessellations 14 http://www.netsci.org/Science/Compchem/feature14e.html

  15. Valence Model Limitations • “Bonds” are non-existent • Model of shared electron orbitals • Difficulty modeling • Aromaticity • Resonance • Tautomers • Etc. 15

  16. Structural Keys • Motivated in part by rapid screening for “functional group” substructures • Pre-compute presence of common / important substructures up front and record in bit-vector • Example of structural keys • Presence of atoms (C, N, O, S, Cl, Br, etc.) • Ring systems • Functional Groups • Aromatic, Phenol, Alcohol (ROH), Amine (RNH2), Acid(RC(=O)OH), Ester, … 16

  17. SMILES Examples 17

  18. SMILES Examples 18

  19. SMILES Examples 19

  20. 20

  21. Generalized Fingerprints • Structural Keys • Generalizes only in proportion to knowledge • Sparsely populated • Good screening filter will have thousands of keys, but each item generally only has a few dozen • Generalized Fingerprints (Spectral Representations) • No pre-defined patterns • Record counts or presence/absence of “substructures” (e.g. labeled paths, trees, etc) • Fixed length (binary) vectors • Fast algorithms • Abstract, hard to traceback meaning of individual bits 21

  22. Systematic Graph Features • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Trace Paths • Depth First Search (CsNsCdO) 22

  23. Integer RNG Seed Several Integers Fingerprint Flowchart • 0 Bonds • O • C • N • 1 Bond • O=C • C–C • C–N • 2 Bonds • O=C-C • C-C-N • 3 Bonds • O=C-C-N Graph Feature Extractor Random Number Generator Hash Function Modulo FP Size [ 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 ] [ 0 0 1 0 1 0 0 0 … 0 1 1 1 0 1 0 1 0 0 ] 23

  24. Other “Fingerprint” Representations • Derived Representations • Information Compression • Example: Local Sensitive Hashing (LSH) • Choose K random lines in high-dimensional space • Project data points • Bin coordinates 24

  25. Summary • Rich set of representations • 1D: SMILES, Fingerprints • 2D: Graph of Bonds • 2.5D: Surfaces • 3D: Coordinates • 3.5D: Conformers • 4D: Isomers, temporal evolution, etc 25

  26. Chemical Informatics • Informatics must be able to deal with variable-size structured data or convert data to “standard” vectorial format • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels 26

  27. Slide Title (Arial 44 pt) • Font Arial 32 pt • Font Arial 28 pt • This Arial 24 pt • 20pt • Again 20 pt – do not use font sizes < 20 pt 27 Place useful information here i.e. Overview

  28. SMILES Examples 28

  29. SMILES Examples 29

  30. SMILES Examples 30

More Related