540 likes | 659 Views
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010. A Pragmatic Vision. “Build a Structure Centric Community” Integrate chemistry across the internet based on “chemical structure”
E N D
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?Antony WilliamsWolfram Summit, September 2010
A Pragmatic Vision “Build a Structure Centric Community” • Integrate chemistry across the internet based on “chemical structure” • A “structure-based hub” to information and data • Let chemists contribute their own data • Allow the community to curate/correct data
We Answer Questions for Chemists • Questions a chemist might ask… • What is the melting point of n-heptanol? • What is the chemical structure of Xanax? • Chemically, what is phenolphthalein? • What are the stereocenters of cholesterol? • Where can I find publications about xylene? • What are the different trade names for Aspirin? • What is the NMR spectrum of Benzoic Acid? • What are the safety handling issues for toluene?
Available Information… • Linked to vendors, safety data, toxicity, metabolism
ChemSpider Today • 24.8 million structures • 400 data sources • Grows daily • Community annotation and curation • We curate, edit, change, enhance data daily
Three Years of Experience • Internet-based chemistry is a mess! • Most public compound databases on the web are contaminated. Including ours! • The annotation/curation of data online is difficult • Most database hosts are non-responsive to feedback – “We are a host/repository of data” • Who cares?
Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science
MeSH – Medical Subject Headings • A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
Does stereochemistry matter? • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione” • Variants of systematic names on PubChem • 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl • 2-methyl-3-[(E)-3,7,11,15-tetramethyl • 2-methyl-3-(3,7,11,15-tetramethyl • 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Wikipedia, C&E News, PubChem C&E News (from ACS)
Internet-Based Chemistry is a Mess • Algorithms can get you so far • Human curation is necessary • Only the crowds can help with big data… ChemSpider is approaching 25 million compounds
“Curate” Identifiers • General curation activities • Remove incorrect names • Correct spellings • Add multilingual names • Add alternative names • In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually • 130 people have participated in validation or annotation. “Crowds” can be quite small!
Crowdsourced “Annotations” • Registered Users can add • Descriptions/Syntheses/Commentaries • Links to articles, blogs, wikis etc • Add spectral data • Add photos • Add MP3 files • Add Videos
Data Validation – Not Beclamethasone Dipropionate DailyMed Article
Data Validation – ONE CymarinQuestion Quality in Big Databases