1 / 24

Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd. Chemical structure Diagrams. Chemical structure diagrams are a form of representation of chemical compounds. Information contained in a structure diagram can be divided into three areas:. Atom information.

vui
Download Presentation

Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

  2. Chemical structure Diagrams • Chemical structure diagrams are a form of representation of chemical compounds. • Information contained in a structure diagram can be divided into three areas: • Atom information • Bond information chemical elements, functional groups, generic elements, • Structural information bond orders, bond styles, bond labels vertex label, charge, atomic weight, hybridization, etc. atom information, bond information, overall charge, structure label

  3. Publication process Manual reproduction chemical OCR What is chemical OCR for? All chemical information is lost! chemical structure diagrams are converted to images 29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 automatic extraction of chemical information from chemical structure depictions 20-90 seconds per page slow and prone to errors

  4. CLiDE Pro A chemical OCR software tool The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci.1993, 33(3), 338-344. [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.

  5. Features • Converts chemical images into connection tables • Loads PDF documents, as well as TIFF and BMP image files • Exports chemical information into MDL MOL files • Supports document-oriented processing as opposed to page-oriented processing • The whole document is loaded and processed at once rather than individual pages. • Handles various difficult drawing features • Interprets generic structures • Operates in interactive or batch mode • Tools for structure and text editing

  6. Three main problems involved in chemical OCR Identification of chemical images within a document. Compilation of chemical graphs of individual molecules from chemical images. • Interpretation of complex objects such as generic structures using • the retrieved chemical graphs.

  7. Document image segmentation CLiDE Pro’s solutions to Problem 1 Identification of connected components Digitized image of a document page of a patent Segmented document highlighting recognized text blocks and graphic blocks Bottom-up layout analysis by building the tree structure of the page Problem 1: Identification of chemical images within a document

  8. 1 4 Chemical image Vectorization 2 Classification of connected components 5 6 3 Construction of connection table Construction of atom labels Construction of dashed bonds CLiDE Pro’s solutions to Problem 2 2 Classification of connected components into basic groups: characters lines dashes graphics Construction of dashed bonds based on the Hough transform method[4] 3 1 A chemical image Construction of atom labels: OCR Grouping characters into atom labels Recognition of superatoms 6 5 4 3D molecular structure after exporting the constructed CT into SDF file in 2D and converting the structure from 2D to 3D Construction of connection table: Connecting lines to atoms Joining lines to form implicit Carbon atoms Vectorization based on a polygon approximation method [5] Problem 2: Extraction of connection tables from chemical images [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.

  9. CLiDE Pro’s solutions to Problem3 1 Generic text interpretation (GTI) R-groups, substitution values, labels • Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. 2 Association the generic text block to the structure by matching R-groups present in both the text and the structure • However, combined assignment to R-groups are handled successfully. Problem 3: Interpretation of generic structures

  10. Alignment of Atom Labels Two types of alignment of atom labels with more than one character: Horizontal atom labels Vertical atom labels Examples

  11. Constructed molecule Input image Alignment of Atom labels The interpreted structure in CLiDE Pro’s GUI:

  12. Ambiguity in interpretation Horizontal lines representing dashes of a dashed wedged bond A horizontal line representing a negative charge Contextual analysis

  13. Constructed molecule Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Input image

  14. Ambiguity in interpretation A vertical line part of a double bond Vertical lines representing Iodine atoms Contextual analysis

  15. Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Input image Constructed molecule

  16. Ambiguity in interpretation Circles represent: Oxygen atoms aromatic rings Contextual analysis

  17. Constructed molecule Ambiguity in interpretation Input image

  18. Constructed molecule Input image Crossing bonds in bridged molecule • No extra Carbon atom is generated at the point where bonds cross each other • Functional groups are expanded in the exported structure

  19. Constructed molecule Input image A generic structure R = H R = Me

  20. Constructed molecule Input image Bad image quality • Isolated black spots (noise from scanning) • Black spots touching one CC • Black spots merging two or more CCs

  21. Constructed molecule Input image Bad image quality

  22. Conclusions and Outlook • CLiDE Pro, a chemical OCR tool • 3 main problems in chemical OCR and CLiDE Pro’s solutions • The quality of interpretation depends on the ability of dealing with difficult situations such as -ambiguous drawing features -distortions resulting from bad image quality • Goal to extend CLiDE Pro on further chemical drawing features such as -Reaction schemes (partly implemented) -Improved generic text interpretation (dealing with tables of R-groups) -Frequency variation in Markush structures -Positional variation in Markush structures -Other difficult situations (e.g. missing bonds between ring atoms)

  23. Palytoxin – A complex structure Input image Constructed molecule

  24. Further Information Acknowledgments CLiDE Pro is licensed withKeymodule Ltd.andSimBioSys Inc. http://www.keymodule.co.uk http://www.simbiosys.ca People who previously worked on CLiDE Live demo at Booth #817

More Related