1 / 20

SPECTRa-T Project

SPECTRa-T Project. Alan Tonge. Semantic Web Data Repositories from Chemistry e-Thesis Data Mining. Open Repositories 2008 Southampton University 2 April 2008. Project Overview. S ubmission, P reservation and E xposure of C hemistry T eaching and R esearch Dat a.

ivana
Download Presentation

SPECTRa-T Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPECTRa-T Project Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008

  2. Project Overview Submission, Preservation and Exposure of Chemistry Teaching and Research Data • 12-month project between University of Cambridge and Imperial College London to develop text- and data-mining tools to extract chemical data from e-theses • Part of the JISC Digital Repositories programme – in Theses

  3. Background Chemistry is an experimental science Synthetic Organic Chemistry is the basis of Pharmaceutical and Agrochemical industries Where does the information to make this molecule come from? Systematic Name : Molecular Formula : Ethyl 4,5-epoxy-hex-2-enolate C8H12O3

  4. Chemical Abstracts (9000+ journals - 12,000 structures/day)Beilstein (180 core journals)Patents (CAS, Derwent, MDL) (400,000 /annum) Search Chemical patent & journal abstracting services – e.g. Academic chemistry publications largely derived from PhD Theses Perhaps ~10K published per year worldwide Synthetic : contains 50-60 preparations – only 20% published in detail

  5. List of Starting Materials & Reagents • Recipe: Reactions Conditions & Work-up • ProductCharacterization – spectroscopic & physical properties

  6. Sample preparation fromsynthetic chemistry thesis

  7. The Problem • ~80% of (academic) synthetic preparations remain locked in theses • Manual abstraction (cf journals/patents) not an option The Solution • OSCAR3: Automatic high-throughput chemicalname and chemical term recognition • Open Source Chemistry Analysis Routines is an extensible Open Source framework which can identify much of the chemical terminology in electronic articles • Semantic Web :Deposit extracted terms in searchable RDF triplestore

  8. OSCAR Name recognition: 1. Dictionary of chemical names/terms (ChEBIOntology) 2. Rules; chemical suffix filters 3. Regular expressions to recognise: data, formulae

  9. Input: PDF Legacy FormatPDF is the de facto format for electronic document deposition in digital repositories Problem: PDF text is a Page Description Format – optimized forhuman, notmachine, readability • irregular word order • line-breaks: loss of continuous text; paragraphs difficult to identify • loss of subscripts and superscripts • non-printing characters • erroneous character assignment with OCR.

  10. Programmatic modifications to: • Remove linebreaks from extended chemical names • Remove text fragments derived from Figures and Tables • Correct whitespace in chemical names OSCAR3 XSLT UTF-8 text SAF XML RDF statements PDF Used ‘as is’ OSCAR used ‘as is’ on PDF e-theses : Gives 5000 terms / thesis (80% duplicates) Cannot identify chemical objects (spectra assignments; properties) Gives 5000 terms / thess

  11. Input:MS Office Open XML – ‘docx’ • No information loss from student’s deposited thesis (written with MS software) • Identification of experimental sections no longer a problem -> Chemical Objects • Conversion of CO’s into Chemical Markup Language Extract chemical terms RDF statements OSCAR3 Link together URI DocX Extract chemical objects CML data files Data Repository

  12. Sample preparation from synthetic chemistry thesisSample preparation from chemistry thesis

  13. CML Infra-Red ASSIGNMENTS <cml:spectrum type="cml:ir"> - <cml:conditionList> <cml:condition title="the form of the IR spectrum“ dictRef="cml:irform">film</cml:condition> </cml:conditionList> - <cml:peakList> <cml:peak id="p1" xValue="3446" title="OH" /> <cml:peak id="p2" xValue="3062" title="unassigned" /> <cml:peak id="p3" xValue="3029" title="unassigned" /> <cml:peak id="p4" xValue="2922" title="unassigned" /> <cml:peak id="p5" xValue="1672" title="C=O" /> <cml:peak id="p6" xValue="1604" title="C=C" /> <cml:peak id="p7" xValue="1496" title="unassigned" /> <cml:peak id="p8" xValue="1454" title="unassigned" /> <cml:peak id="p9" xValue="1366" title="unassigned" /> <cml:peak id="p10" xValue="1299" title="unassigned" /> <cml:peak id="p11" xValue="1135" title="unassigned" /> <cml:peak id="p12" xValue="1078" title="unassigned" /> <cml:peak id="p13" xValue="974" title="unassigned" /> </cml:peakList> </cml:spectrum> CML C-13 NMR ASSIGNMENTS <cml:spectrum type="cml:cnmr"> - <cml:parameterList> <cml:parameter dictRef="cml:frequency" units="units:MHz">50</cml:parameter> </cml:parameterList> - <cml:substanceList> <cml:substance ref="" /> </cml:substanceList> - <cml:peakList> <cml:peak xValue="198.6" integral="" peakMultiplicity="" title="C=O" /> <cml:peak xValue="198.5" integral="" peakMultiplicity="" title="" /> <cml:peak xValue="145.0" integral="" peakMultiplicity="" title="C" /> <cml:peak xValue="142.7" integral="" peakMultiplicity="" title="C" /> <cml:peak xValue="137.3" integral="" peakMultiplicity="" title="CH2" /> <cml:peak xValue="136.7" integral="" peakMultiplicity="" title="CH2" /> <cml:peak xValue="129.1" integral="" peakMultiplicity="" title="" /> <cml:peak xValue="128.6" integral="" peakMultiplicity="" title="" /> <cml:peak xValue="126.7" integral="" peakMultiplicity="" title="" /> <cml:peak xValue="124.0" integral="" peakMultiplicity="" title="aryl-C" /> <cml:peak xValue="62.5" integral="" peakMultiplicity="" title="CH" /> <cml:peak xValue="59.0" integral="" peakMultiplicity="" title="CH" /> <cml:peak xValue="55.2" integral="" peakMultiplicity="" title="CH" /> <cml:peak xValue="54.9" integral="" peakMultiplicity="" title="CH" /> <cml:peak xValue="38.5" integral="" peakMultiplicity="" title="CH2" /> <cml:peak xValue="32.8" integral="" peakMultiplicity="" title="CH2" /> <cml:peak xValue="26.1" integral="" peakMultiplicity="" title="CH3" /> <cml:peak xValue="26.0" integral="" peakMultiplicity="" title="CH3" /> </cml:peakList> </cml:spectrum>

  14. RDF - Resource Description Framework. A component of the Semantic Web, it is based upon the idea of making statements about resources/data in the form of a subject-predicate-object(orresource-property-value) expression (called a triple) e.g. : My_thesis has_chemical_entity 2,4-dinitrobenzene The value of one property can in turn be used as the resource for another.

  15. SPARQL QUERY PREFIX st: <http://wwmm.ch.cam.ac.uk/spectra-t#> PREFIX dcrdf: <http://purl.org/metadata/dublin_core#> CONSTRUCT { ?thesis st:hasBicycloMoleculeAndHNMR ?chemical . ?thesis dcrdf:author ?author } WHERE { ?thesis dcrdf:creator ?author . ?thesis st:hasChemicalName ?annot . ?annot st:chemicalName ?chemical . ?annot st:hasHNMRSpectrum ?hnmr . FILTER regex(?chemical, ".*bicyclo.*") . } RDF TRIPLESTORE ENTRY <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcrdf="http://purl.org/metadata/dublin_core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:spectra-t="http://wwmm.ch.cam.ac.uk/spectra-t#"> <rdf:Description rdf:about="file:/C:/spectra-t-theses/Juergen_Harter.docx"> <spectra-t:hasChemicalName> - <rdf:Description> <spectra-t:chemicalName>CDCl3</spectra-t:chemicalName> <spectra-t:hasSMILES>ClC([2H])(Cl)Cl</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/CHCl3/c2-1(3)4/h1H/i1D</spectra-t:hasInChI> </rdf:Description> </spectra-t:hasChemicalName> <spectra-t:hasChemicalName> - <rdf:Description> <spectra-t:chemicalName>1-Benzyloxy-but-3-yne</spectra-t:chemicalName> <spectra-t:hasSMILES>C#CCCOCC1=CC=CC=C1</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/C11H12O/c1-2-3-9-12-10-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2</spectra-t:hasInChI> <spectra-t:hasHNMRSpectrum>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasCMLMolecule>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasCMLMolecule> <spectra-t:hasPreparation>http://ch.cam.ac.uk:8182/1ea7f8cd07/preparation-0.sci.xml</spectra-t:hasPreparation> </rdf:Description> </spectra-t:hasChemicalName> <spectra-t:hasChemicalName> - <rdf:Description> <spectra-t:chemicalName>(3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-methyl-oct-3-en-2-one</spectra-t:chemicalName> <spectra-t:hasHNMRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasIRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasIRSpectrum> <spectra-t:hasMassSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasMassSpectrum> <spectra-t:hasHRMSSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHRMSSpectrum> <spectra-t:hasPreparation>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/preparation-20.sci.xml</spectra-t:hasPreparation> </rdf:Description> </spectra-t:hasChemicalName> </rdf:Description> <rdf:RDF> RESULT <rdf:Description rdf:about="file:/C:/spectra-t-articles/B207708F.docx"> <st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> <st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> <st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> <st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> <st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> <st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR> <dcrdf:author>N.R.Champness</dcrdf:author> </rdf:Description>

  16. Message to repository managers: PDFis a limited format for data extraction from e-theses Docx allows chemical data object extraction (~80% precision / recall) Solutions : Domain ontology development Make your e-theses public! Caveats(Proof-of-concept): Single subject area (synthetic organic chemistry) Single institution docx (limited variation in document structure) Limited thesis availability

  17. Acknowledgements • Project Director: Peter Morgan UL Cambridge • Chemistry leads: Henry Rzepa, Peter Murray-Rust • Developers:Jim Downing, Diana Stewart, Joe Townsend, Matt Harvey • Project Manager:Alan Tonge http://www.lib.cam.ac.uk/spectra-t/

  18. SPECTRa Tools Workshop Autumn 2008 Unilever Centre, Cambridge, UK Contact: Peter Murray-Rust (pm286@cam.ac.uk) Peter Morgan (pbm2@cam.ac.uk)

More Related