1 / 54

A Pragmatic Vision

Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010. A Pragmatic Vision. “Build a Structure Centric Community” Integrate chemistry across the internet based on “chemical structure”

lilith
Download Presentation

A Pragmatic Vision

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?Antony WilliamsWolfram Summit, September 2010

  2. A Pragmatic Vision “Build a Structure Centric Community” • Integrate chemistry across the internet based on “chemical structure” • A “structure-based hub” to information and data • Let chemists contribute their own data • Allow the community to curate/correct data

  3. www.chemspider.com

  4. We Answer Questions for Chemists • Questions a chemist might ask… • What is the melting point of n-heptanol? • What is the chemical structure of Xanax? • Chemically, what is phenolphthalein? • What are the stereocenters of cholesterol? • Where can I find publications about xylene? • What are the different trade names for Aspirin? • What is the NMR spectrum of Benzoic Acid? • What are the safety handling issues for toluene?

  5. Search for a Chemical…by name

  6. Available Information… • Linked to vendors, safety data, toxicity, metabolism

  7. Available Information….

  8. Search for chemicals

  9. ChemSpider Today • 24.8 million structures • 400 data sources • Grows daily • Community annotation and curation • We curate, edit, change, enhance data daily

  10. Linked Data on the Web

  11. Three Years of Experience • Internet-based chemistry is a mess! • Most public compound databases on the web are contaminated. Including ours! • The annotation/curation of data online is difficult • Most database hosts are non-responsive to feedback – “We are a host/repository of data” • Who cares?

  12. Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science

  13. What is the Structure of Vitamin K?

  14. MeSH – Medical Subject Headings • A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

  15. What is the Structure of Vitamin K1?

  16. What is the Structure of Vitamin K1?

  17. Chemical Abstracts“Common Chemistry” Database

  18. Wikipedia

  19. Incorrect Structures

  20. Wow!

  21. Lack of Stereochemistry

  22. Does stereochemistry matter? • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

  23. PubChem

  24. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione” • Variants of systematic names on PubChem • 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E)-3,7,11,15-tetramethyl • 2-methyl-3-(3,7,11,15-tetramethyl • 2-methyl-3-[(E)-3,7,11,15-tetramethyl

  25. ChEBI – Manual Curation

  26. What’s Methane?

  27. What’s Methane?

  28. What ELSE is Methane???

  29. The EXPERTS must get it right?!

  30. Wikipedia, C&E News, PubChem C&E News (from ACS)

  31. Internet-Based Chemistry is a Mess • Algorithms can get you so far • Human curation is necessary • Only the crowds can help with big data… ChemSpider is approaching 25 million compounds

  32. Search “Vitamin H”

  33. Search “Vitamin H”

  34. “Curate” Identifiers

  35. “Curate” Identifiers

  36. “Curate” Identifiers

  37. “Curate” Identifiers • General curation activities • Remove incorrect names • Correct spellings • Add multilingual names • Add alternative names • In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually • 130 people have participated in validation or annotation. “Crowds” can be quite small!

  38. Crowdsourced “Annotations” • Registered Users can add • Descriptions/Syntheses/Commentaries • Links to articles, blogs, wikis etc • Add spectral data • Add photos • Add MP3 files • Add Videos

  39. Data Validation – Not Vitamin K1

  40. Data Validation – Not Beclamethasone Dipropionate DailyMed Article

  41. Data Validation …NOT Cholesterol

  42. Data Validation – ONE CymarinQuestion Quality in Big Databases

More Related