1 / 59

Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Computer Structure Codes (after lectures by Dr. J.M. Barnard). How do you store chemical structures on computer? What can you do with them there? How do the computer systems used in chemical informatics work?. Representing a chemical structure. How much information do you want to include?

admon
Download Presentation

Computer Structure Codes (after lectures by Dr. J.M. Barnard)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Structure Codes(after lectures by Dr. J.M. Barnard) • How do you store chemical structures on computer? • What can you do with them there? • How do the computer systems used in chemical informatics work?

  2. Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

  3. Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types (aromatic ring identification) • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

  4. Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

  5. Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

  6. Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

  7. 2D structure diagram • chemists’ “natural language” • used by most computer systems for display • shows topology, optionally stereochemistry • several commonly-used computer programs allow input /editing of structure diagrams • ISIS/Draw (MDL) http://www.mdl.com • ChemDraw (CambridgeSoft) http://www.cambridgesoft.com/products/ • GRINS/JavaGRINS (Daylight) http://www.daylight.com/products/javatools.html

  8. 2D structure diagram • provides 2D pictorial representation of chemical structure • display on screen • cut/paste/embed in Word document etc. • inter-convert with other forms for further processing • database searching • structure analysis • property prediction • database analysis

  9. Registry Numbers • unique identifiers for compounds or substances • catalog number • most chemical databases have them • Chemical Abstracts • Beilstein • private compound registries in pharmaceutical companies • usually just “idiot numbers” • no chemical information • may have hierarchical structure parent compound  stereoisomer  salt  batch • need to decide what is a separate compound

  10. Line Notations • represent structures as compact linear string of alphanumeric symbols • easily handled by computer • compact storage • easily transmitted over a network • allow rapid manual coding/decoding by trained users • much faster for input than using a structure drawing program

  11. Line Notations: SMILES Simplified Molecular Input Line Entry System • developed by Dave Weininger (Daylight) OC(=O)C(N)CC1=CC=C(O)C=C1

  12. Other linenotations • ROSDAL (Beilstein) Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O • Sybyl Line Notation (Tripos) OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 • Wiswesser Line Notation (WLN) (obsolete) QVYZ1R DQ

  13. Connection Tables (CTs) • main form of structure representation in computer systems • list atoms and bonds (and other data) as a table • many different formats • “internal” CTs (in memory) • algorithmic processing • “external” CTs (disk files) • archival storage • data exchange between programs

  14. Internal Connection Table • usually “redundant” • every bond shown twice, once for each atom • implemented as array of records • record for each atom might store • atomic type • hydrogen count • formal charge • 2D display co-ordinates • bonds to neighboring atoms • etc.

  15. “Redundant” Connection Table • O1 2 1 • C 0 1 1 3 2 4 1 • O 0 2 2 • C 12 1 5 1 6 1 • N2 4 1 • C2 4 1 7 1 • C0 6 1 8 2 12 1 • C 17 2 9 1 • C1 8 1 10 2 • C 0 9 2 11 1 13 1 • C 1 10 1 12 2 • C 1 11 2 7 1 • O 1 10 1

  16. MDL Connection Table • proprietary file format developed by MDL • http://www.mdl.com/downloads/latest_releases/index.jsp • de facto standard for exchange of datasets • several different flavours and versions • Molfile (single molecule) • SDfile (set of molecules and data) • RGfile (Markush structure) • Rxnfile (single reaction) • RDfile (set of reactions with data) • separates atoms, bonds into separate blocks

  17. Standard Connection Table Formats • different vendorshave proprietary CT formats • many attempts to establish agreed “standard” formats • no real general success • different user communities have failed to coordinate efforts • some standards exist in restricted areas • SMILES and MDL CT formats widely used • most popular programs read/write several different formats

  18. Standard Connection Table Formats • Standard Molecular Data (SMD) format • never gained wide acceptance • Protein Data Bank (PDB) format • Crystallographic Information File (CIF) • Molecular Information File (MIF) • developed from SMD and compatible with CIF • Chemical Exchange Format (CXF) • Chemical Abstracts Service • Chemical Markup Language (CML) • for data exchange using the Internet • INChI (IUPAC/NIST Chemical Identifier)

  19. Conclusions • There are lots of ways of storing a chemical structure in a computer • including different amounts of information • Most important ones are • line notations (e.g. SMILES) • connection tables (e.g. MDL Molfile) • nomenclature • Structure diagrams used for input/output

  20. Topological Graph Theory • branch of mathematics • particularly useful in chemical informaticsand in computer science generally • study of “graphs” which consist of • a set of “nodes” • a set of “edges” joining pairs of nodes

  21. Properties of graphs • graphs are only about connectivity • spatial position of nodes is irrelevant • length of edges are irrelevant • crossing edges are irrelevant

  22. Structure Diagrams as Graphs • 2D structure diagrams very like topological graphs • atoms  nodes • bonds  edges • terminal hydrogen atoms are not normally shown as separate nodes (“implicit” H) • reduces number of nodes by ~50% • “hydrogen count” information used to colour neighbouring “heavy atom” atom • separate nodes sometimes used for “special” hydrogens • deuterium, tritium • hydrogen bonded to more than one other atom • hydrogens attached to stereocentres

  23. Advantages of using graphs • mathematical theory is well understood • graphs can be easily represented in computers • many useful algorithms are known • identical graphs  identical molecules • different graphs  different molecules

  24. Disadvantages of graphs • analogy between chemical structures and graphs is not perfect • identical graphs <=/=> identical molecules • different graphs <=/=> different molecules • realities of chemical structures cause problems • aromaticity stereochemistry • tautomerism coordination compounds • multi-centre bonds inorganic compounds • macromolecules polymers • incompletely-defined substances • many graph algorithms are inherently slow

  25. Aromaticity • electronic property of certain ring systems, giving enhanced chemical stability • bonds in aromatic rings have properties that are distinct from single and double bonds • generally accepted definition is Hückel rule • 4n+2 pi-electrons (n is a small integer) • there are borderline cases • aromaticity causes problems for computer representation • different systems deal with it in different ways

  26. Aromaticity problems • using single and double bonds can give different topological graphs for the same compound • one solution is to usean aromatic bond type

  27. Alternating bonds and aromaticity • Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds • this includes some systems that are not aromatic • and omits some that are

  28. Representing aromaticity • some systems represent aromaticity as an atom property • SMILES allows use of lower-case atomic symbols for aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds) • problem: aromaticity is really a ring property

  29. Tautomerism • dynamic equilibrium between positional isomers (labile H) • are they different compounds? • answer depends on what you want to do with them • can use normalised bondsto represent them by a single graph • gets mixed up with ringalternating bonds • some tautomers may bearomatic, when others are not

  30. Tautomerism • tautomerism is a matter of degree • tautomers can be defined in different ways HQ–X=R  Q=X–RH only certain elements can be Q, X or R • keto-enol tautmersare not recognisedby Chemical Abstracts • mono-unsaturatedcarbon chains arenot distinguishedby Daylight

  31. Structure conventions sometimes called “business rules” • some chemical groups can be shown in different but equally valid ways • conventions are needed to determine which is preferred • software may be needed to convert to preferred form

  32. Stereochemistry • different compounds with identical connectivity • same topology, different topography S-tyrosine R-tyrosine

  33. Stereochemistry • configuration is often unknown • or partially known (relative stereochemistry) • or you may have a mixture of stereoisomers • in which one isomer may occur in enantiomeric excess • many different descriptors used by chemists • wedge (up) and hatched (down) bonds in structure diagrams • Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z) • text-based descriptors (stereoparent, or optical rotation)

  34. Stereochemistry: up/down bonds • can be used as additional “colours” for graph edges • many connection table formats have special codes for up and down bonds • need to know which end of bond is which • useful for re-generating diagrams for display • can be used to calculate other stereo descriptors

  35. Up/down bond problems • different patterns of up/down bonds can show the same stereo- isomer • different graphs, same molecule • some patterns of up and down bonds actually convey no useful information about configuration

  36. Stereochemistry: CIP designators • R.S. Cahn, C. Ingold, and V. Prelog, • Angewandte Chemie Intl. Ed. in English1966, 5, 385-551 • one-letter designator for stereocenters • based on rules assigning priorities to groups around it • tetrahedral carbons (R, S) • double bonds (E, Z) • additional colors for graph nodes or edges • useful for distinguishing stereoisomers when absolute configuration is known • less useful for matching parts of structures (substructure search) as priority rules can cause designator to change when remote part of structure is changed

  37. Double bond stereo in SMILES / and \ used as “directional” single bonds • only meaningful when used on both atoms of a double bond • several ways of showing same configuration

  38. Other complications • Organometallic and co-ordination compounds • complex stereochemistry • special bond types may be needed (dative bonds etc.) • ambiguity over covalent/ionic character of bonds • “business rules” rules usually needed • Inorganic compounds • topological representation often not possible • composition may not involve integral ratios between elements

  39. Macromolecules • in principle can represent all atoms, as for small molecules • some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)

  40. Macromolecules • Each shortcut is defined with appropriate attachment points • ordinary atoms can bemixed with shortcuts • system can expandshortcuts when needed

  41. Polymers • special problems are presented because properties of polymer can be affected by polymerisation conditions • average number of subunits • extent of cross-linking • ratio between different subunits • random / block sequences of subunits • etc. • Two main approaches • monomer representation • structural repeating unit (SRU) representation

  42. Incompletely-defined substances • unknown stereochemistry • unknown attachment position • unknown repetition

  43. Markush (“Generic”) structures • structures with R-groups • shorthand for describing sets of structures with common features

  44. Markush structures • also called “generic” structures • very important in chemical patents • inventor claims whole class of related compounds • can be used to describe combinatorial libraries • can be used as queries in database searches

  45. Canonicalization • a given chemical structure (or graph) can have many valid and unambiguous representations • different order of rows in connection table • different order of atoms in SMILES • for comparison purposes it would be useful to have a single unique or “canonical” representation • process of converting input representation to canonical form is called “canonicalization” or “canonization” • process of applying “rules” (i.e. an algorithm)

  46. Canonicalization • an obvious approach: • generate all possible valid SMILES • choose the one that comes first alphabetically • this would be very slow, but effective, and there is a danger of missing one • principle was used for canonicalizing Wiswesser Line Notation

  47. Canonicalization • most methods in use today involve renumbering the atoms in some unique and reproducible way • can be used to number rows in connection table • can determine order of atoms in SMILES • normally involve a node labelling technique called “relaxation” • example is Morgan’s algorithm (1965)

  48. Symmetry perception • if ties between label values cannot beresolved on basis of atom/bond types, the atoms are symmetrically equivalent, andit doesn’t matter which is chosen next • Morgan’s algorithm is thus also useful for identifying symmetry in molecules

  49. Morgan’s algorithm • Works by taking more of the graph into account at each iteration • essence of “relaxation” technique is iteratively updating a value by looking at its immediate neighbours • It is not infallible • graphs (“isospectral” graphs) are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent • There are many variations on it • and several theoretical papers analysing it mathematically

  50. Ring perception • How many rings are there in these structures and which ones are they? • rings are important features of chemical structures • nomenclature generation • aromaticity perception • synthetic significance • fragment descriptor generation

More Related