1 / 59

Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush). Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011. I want to know about “Vincristine”.

faunia
Download Presentation

Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

  2. I want to know about “Vincristine”

  3. Vincristine: Identifiers and Properties

  4. Vincristine: Vendors and Sources

  5. Vincristine: Patents

  6. Vincristine: Articles

  7. Vincristine: RSC Databases

  8. Searches: The INTERNET

  9. Validated Names for Searching…

  10. And InChIs…

  11. ChemSpider • The Free Chemical Database • A central hub for chemists to source information • >26 million unique chemical records • Aggregated from >400 data sources • Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions • A central hub for chemists to deposit & curate data

  12. Essential aspects of ChemSpider • ChemSpider is a BIG database..and growing • Our focus has increasingly become QUALITY over quantity • Data curation and validation is our strength – crowdsourcing is contributing, more is required • Validated data has enabled linking of the internet

  13. There are NO errors in ChemSpider

  14. There are NO errors in ChemSpider

  15. “All That Glisters is Not Gold”What is the structure of Discodermolide?

  16. How to distinguish…who’s wrong?

  17. Neither is wrong

  18. Data Curation…long torturous task • Data curation – JUST structure-name validation is a long, torturous, iterative task. • How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

  19. Hand on my heart….

  20. Hand on my heart • No offence meant by what follows! We ALL have quality issues!

  21. PHYSPROP Database The freely downloadable database under the EPI Suite prediction software Very Basic filters suggest data quality issues

  22. The Stereochemistry challenge.12500 chemicals with “missed” stereo

  23. NIST Webbook

  24. EPA’s DailyMed

  25. EPA’s DailyMed

  26. EPA’s DailyMed

  27. PubChem

  28. Linking

  29. Patents

  30. Patents

  31. WYSIWYG compounds

  32. WYSIWYG compounds

  33. Data Curation…long torturous task • Data curation – JUST structure-name validation is a long, torturous, iterative task. • How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra • The crowd in crowdsourcing is …generally small • Which of the large databases are doing careful curation. How can we share the workload? Hmm..

  34. Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.

  35. Who does the Curation?

  36. ChemSpider can “do it” for us • ChemSpider has built a curation interface used by the community and ourselves for curating. • All curation activities are available for review, online immediately, iteratively checked. • Curators have different abilities based on their profile: There are only a few “Master Curators”. • Can we “share” the curation workload?

  37. Proof of Concept Data Curation Sharing

  38. Identifier Dictionaries • Reciprocal curation processes…share curation with each other. • If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. • A series of “added” and “removed” synonyms against InChIKeys for matching. • Who will participate???

  39. Proof of Concept Data Curation Sharing

  40. Lessons Learned : Big vs Good!

  41. 15 compounds called Yohimbine54 Skeletons for Yohimbine

  42. Aggegators suffer dilution…

  43. User Understanding of Data • Users searching “Yohimbine” expect to find it…not labeled versions of it, not ambiguous stereochemistries, not partial stereochemistries. • Data “aggregation” into a meaningful form is a major challenge. e.g. Assays for radiolabeled compounds linked to actual drugs. • Data curation efforts such as ChEMBL are essential!

  44. SciMobileApps.com

More Related