1 / 21

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining. Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE 19711 kirk.a.simmons@usa.dupont.com. The Study. Purpose Strategy Methods Metrics Results

sue
Download Presentation

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE 19711 kirk.a.simmons@usa.dupont.com

  2. The Study • Purpose • Strategy • Methods • Metrics • Results • Practical Application • Conclusions

  3. Purpose • Chemical Structure Conference (1996) – Holland • Data mining/similarity methodologies reported • Used numerous descriptor sets • No standard datasets • Comparisons difficult • Comparative study of chemical descriptors across varied biology

  4. Strategy • Systematically evaluate descriptors within a compound dataset across multiple biological endpoints • All compounds have experimentally measured endpoints • Diversity of biological endpoints • In-Vitro (receptor affinity, enzyme inhibition) • In-Vivo (insect mortality) • Explored nine common descriptor sets • Train and then use model to forecast a validation set

  5. Methods • Four In-Vitro assays • 48K compound dataset for training • Corporate database for validation • Two In-Vivo assays • 75-100K compound datasets • Randomly divided into training and validation subsets • Recursive Partitioning - analytic method • Appropriate method for HTS data • Selected statistically conservative inputs (p-tail < 0.01)

  6. Metrics • 4-way Interaction • Analytic Method, Compound Set, Biology, and Descriptors • Efficiency of analysis (Lift Chart) • Fraction of Actives found/Fraction of Dataset tested • Rewards efficiency only • Effectiveness of analysis (Composite Score) • Fraction of Actives found x Efficiency • Rewards efficiency as well as completeness

  7. Results - Training

  8. Results - Forecasting

  9. Averaged Results - Training

  10. Averaged Results - Forecasting

  11. Practical Application • RP-based models using screening data on 3 targets • Activity treated as active/inactive • DiverseSolutionsR BCUT descriptors • RP-models used to forecast vendor compounds (1M) • Selected compounds purchased/screened • Hit-rates improved 530% over training sets • New structures and improved activity

  12. Historical Screening Results

  13. RP-based Screening Results

  14. Results Comparison

  15. Conclusions • Not all chemical descriptors equally effective • Whole molecule property-based less effective • Chemical feature-based appear more effective • Training models effectiveness • Averaged 28% of theory • Room for 4-fold improvement • Validation models effectiveness • Averaged 16% of theory • Room for 6-fold improvement

  16. Acknowledgements • Dr. Linrong Yang, FMC Corporation • Completed the work • FMC Corporation • Release of the results • Prof. Peter Willett, University of Sheffield • Prof. Alex Tropsha, University of North Carolina • Prof. Doug Hawkins, University Minnesota • DuPont Corporation

More Related