1 / 38

SMILES 2

SMILES 2. C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004. Linear Notations. Represent the atoms, bonds, and connectivity as a linear text string SMILES Concise Orignally designed for manual command line entry into text-only systems Now widely used

nemo
Download Presentation

SMILES 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMILES 2 C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004

  2. Linear Notations • Represent the atoms, bonds, and connectivity as a linear text string • SMILES • Concise • Orignally designed for manual command line entry into text-only systems • Now widely used • Can be input to a spreadsheet cell, on one line of a text file, or in an Oracle database text field • System to generate canonical form of SMILES

  3. Review of SMILES • Atoms represented by normal chemical symbols (uppercase for aliphatics, lowercase for aromatic) • Adjacent atoms imply single bonds • Use = for double, # for triple bonds • Hydrogens usually implicit • Parentheses imply branching • Ring closure indicated by numbers

  4. SMILES Review (cont’d) • Can make Hydrogens explicit • Non-organic atoms are put in square brackets, e.g., [Xe] • Charged species also in square brackets with a + or -, e.g., [Na+] or [O-] • Unknown atoms indicated by a * • Stereochemistry represented by @@

  5. SMILES for Tyrosine NC(Cc1ccc(O)cc1)C(=O)O

  6. SMILES FOR Acetaminophen (Tylenol) O=C(O)Nc1ccc(O)cc1

  7. SMILES for Isatin O=c2[nH]c1ccccc1c2=O

  8. Canonicalizing SMILES – Morgan Algorithm • Each atom has a connectivity value: how many atoms it is connected to • That value is replaced by the sum of the connectivity values of the its neighbors • Continues iteratively, until number of different values is maximized • Atoms are numbered in decreasing order of connectivity value • In case of a tie, other properties are used (e.g. atomic number, bond order, etc).

  9. Canonicalizing SMILES – CANGEN • Two-stage procedure used by Daylight • First stage CANON, generates a canonical connection table using a modified version of the Morgan Algorithm that produces a tree structure • Second stage GENES creates a unique SMILES using a depth-first search of a the molecular graph tree output by CANON • More information – JCICS 29,1989,97-101

  10. Representing reactions CH4 + 2O2 CO2 + 2H2O • Need to identify the 2D arrangement of products and reagents and distinguish them) • Possibly map which starting material atoms map to which product atoms. • Other information (e.g., yield, equilibrium constants, conditions generally stored separately • Not all reactions specified stoichiometrically

  11. Simple Reaction SMILES • Each reagent and product represented as SMILES • Reagents on the left of a “>>”; products on the right • Individual reagents and products are separated by a “.” CH4 + 2O2 CO2 + 2H2O Reaction SMILES: C.OO>>C(O)O.O

  12. Reaction SMILES example • Agents specified between the two “>>” Reaction SMILES: C.O=O>O=[O+]-[O-]>O=C=O.O

  13. Reaction SMILES example • Note implicit hydrogens Reaction SMILES: C(=O)Cl.NC>>C(=O)NC.Cl

  14. Atom-mapping SMIRKS representation • Each reactant atom gets a tag (e.g “C” becomes “[C:1]”) which maps to the same product tag. • Hydrogens are explicit SMIRKS: [C:1](=[O:2])[Cl:3].[H:99][N:4]([H:100])[C:0]>>[C:1](=[O:2])[N:4]([H:100])[C:0].[Cl:3][H:99]

  15. Daylight RS/SMIRKS Sites • Basic reaction representation (Reaction SMILES) • http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html • SMIRKS introduction • http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html • SMIRKS theory • http://www.daylight.com/dayhtml/doc/theory/theory.rxn.html • SMIRKS depicter • http://www.daylight.com/daycgi_tutorials/react.cgi

  16. Representing generic structures • A generic structure is one which, by ambiguity, represents a (possibly infinite) set of possible structures • Ambiguity usually takes the form of “R” groups • Originally used for representing patents • Now used for representing combinatorial libraries too • Also known as Markush Structures

  17. Specifying a substructure query with SMARTS • SMARTS: a superset of SMILES extended to allow partial structures (substructures) and optional parts of molecules to be represented • Simple example *C(=O)O where the * represents an attachment point (i.e. any number of any atoms) • More information: • http://www.daylight.com/meetings/summerschool01/course/basics/smarts.html • http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

  18. SMARTS special characters (examples)

  19. SMARTS examples

  20. Try out a SMARTS search • DepictMatch: • http://www.daylight.com/cgi-bin/contrib/depictmatch.cgi • Enter a set of SMILES and a SMARTS, and any part of the SMILES that is found in the SMARTS is highlighted • As an example, we’ll use the sample dataset described on the following two slides, and use *C(=O)O (carboxyl group) as our SMARTS and RC(=O)O (carboxyl attached to a ring)

  21. Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate

  22. Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate

  23. Web / Oracle Systems • Advantages • Single database for structures and data • No software to install on client machines (except maybe plug-ins like Chime) • Not dependent on (expensive) contract with MDL • Highly customizable • Disadvantages • Requires extensive web-based interface software to be written, for registration, searching, etc • Company will have to maintain system internally • Requires current ISIS system to be abandoned

  24. Chemistry Cartridges • Daylight DayCart • http://www.daylight.com/products/daycart.html • Tripos Auspyx • http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/auspyx.html • Accelrys Accord for Oracle • http://www.accelrys.com/accord/oracle.html • MDL Direct • http://www.mdl.com/products/framework/rel_chemistry_server/index.jsp • IDBS ActivityBase • http://www.id-bs.com/products/abase/ • JChem Cartridge • http://www.jchem.com

  25. Example - DayCart • Store SMILES as string (VARCHAR2) in Oracle database • Cartridge provides extra functions and extensions to functions for searching based on chemical structures • Structure search implemented by EXACT function • Substructure search implemented by MATCHES function • Similarity search implemented by TANIMOTO and EUCLID functions

  26. Measuring similarity between molecules • Similar Property Principle: “Molecules with similar structure are likely to have similar biological activity” • Generally the Tanimoto Coefficient or Euclidean Distance between fingerprints is used

  27. c Tanimoto Similarity = #a + #b - c Fingerprint Similarity – Tanimoto • Also known as Jaccard Coefficient • ‘1s’ in common / ‘1s’ not in common • 0’s are treated as not significant • Similarity is between 0 (dissimilar) and 1 (same) • Good cutoff for likely biologically similar molecules is 0.7 or 0.8 c = ‘1’s in common #a = ‘1’s in fingerprint A #b = ‘1’s in fingerprint B A 101101011 B 011101101 c = 4 #a = 6 #b = 6 • Example: Tanimoto Similarity =4 / ( 6 + 6 – 4 ) = 0.5

  28. Fingerprint similarity – Euclidean • Pythagorean distance • For binary dimensions, equivalent to the square root of the Hamming distance (i.e. square root of the number of bits that are different) • 0’s are treated as significant • Smaller values mean more similar • Example: 101101011 011101101 Different?xx xx Euclidean distance = sqrt(4) = 2.0

  29. Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate

  30. Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate

  31. Oracle table Test for sample dataset Smiles Name LogP ------ ---- ---- CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(C)NCC(O)COc1ccccc1CC=C Alprenolol 2.81 CC(N)Cc1ccccc1 Amphetamine 1.76 CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine 5.20 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60

  32. DayCart structure search using SQL select * from Test where exact(Smiles, “CC(N)Cc1ccccc1”) = 1; Smiles Name LogP ------ ---- ---- CC(N)Cc1ccccc1 Amphetamine 1.76

  33. DayCart substructure search select * from Test where matches(Smiles, “*C(=O)O”) = 1; Smiles Name LogP ------ ---- ---- CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60

  34. Substructure search for carboxylic acid Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate

  35. DayCart substructure / value search select * from Test where (matches(Smiles, “*C(=O)O”) = 1) AND (LogP > 1.0)); Smiles Name LogP ------ ---- ---- OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 COC(=O)c1ccccc1O Salicylate 2.60

  36. DayCart similarity search Aspirin select * from TEST where tanimoto(SMILES, “CC(=O)Oc1ccccc1C(=O)O”) > 0.6; SMILES NAME LOGP ------ ---- ---- COC(=O)c1ccccc1O Salicylate 2.60 CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(N)Cc1ccccc1 Amphetamine 1.76

  37. Similarity search for carboxylic acid   Acetaminophen Alprenolol Amphetamine Captopril  Chlorpromazine Diclofenac Gabapentin Salicylate

  38. More examples of DayCart http://www.daylight.com/meetings/summerschool02/course/admin/daycart_hints.html

More Related