1 / 18

Active Learning Strategies for Compound Screening

Active Learning Strategies for Compound Screening. Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical Engineering, Boston University 229 th ACS National Meeting March 13-17, 2005 San Diego, CA. Outline.

Download Presentation

Active Learning Strategies for Compound Screening

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Learning Strategies for Compound Screening Megon Walker1 and Simon Kasif1,2 1Bioinformatics Program, Boston University 2Department of Biomedical Engineering, Boston University 229th ACS National Meeting March 13-17, 2005 San Diego, CA

  2. Outline • Introduction to active learning for compound screening • Objectives and performance criteria • Algorithms and procedures • Thrombin dataset results • Preliminary conclusions

  3. Introduction: drug discovery • drug discovery is an iterative process • goal: to identify many target binding compounds with minimal screening iterations descriptors compounds screening selection

  4. Introduction:supervised learning • input: data set with positive and negative examples • output: a classifier such that for each example • = 1 if example is positive • = -1 if example is negative • standard learning • classifier trains on a static training set • train, then test • active learning • classifier chooses data points for training set • classifer “requests” labels • iterative rounds of training and testing

  5. Introduction:active learning & compound screening • Mamitsuka et al. Proceedings of the Fifteenth International Conference on Machine Learning, 1998:1-9. • Warmuth et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 1st query 2nd query

  6. Objectives • exploration • Accurate model of activity • Sensitivity • exploitation • Hit Performance • Enrichment Factor (EF)

  7. Start Methods: datasets Input data files Pick training and testing data for next round of cross validation • 632 DuPont thrombin-targeting compounds • 149 actives • 483 inactives • a binary feature vector for each compound • shaped-based features • pharmacophore features • 139,351 features • retrospective data • 200 features selected by mutual information (MI) w.r.t. activity labels • mean MI = 0.126 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics • Warmuth et al. J. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. • Eksterowicz et al. J Mol Graph Model. 2002 Jun;20(6):469-77. • Putta et al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. • KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ End

  8. Start Methods: cross validation Input data files Pick training and testing data for next round of cross validation • 5X cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density 2nd 1st Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  9. Start Methods: perceptron Input data files • given • binary input vector, • weight vector, • threshold value, T • learning rate, n • classification, t • TEST: • TRAIN: • if classified correctly, do nothing • if misclassified, Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  10. Start Methods: classifier committees Input data files Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no • bagging: uniform sampling distribution • boosting: compounds misclassified by classifier #1 more likely resampled by classifier #2 All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  11. Start Methods: weighted voting Input data files Pick training and testing data for next round of cross validation • weighted vote of all classifiers predicts compound activity label 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  12. Start Methods: sample selection strategies Input data files Pick training and testing data for next round of cross validation • P(active) : select compounds predicted active with highest probability by the committee • uncertainty: select compounds on which the committee disagrees most strongly • density with respect to actives: select compounds most similar to previously labeled or predicted actives • Tanimoto similarity metric • given compound bitstrings A and B • a = # bits on in A • b = # bits on in B • c = # bits on in both A and B 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: -P(active) -uncertainty -density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  13. Start Methods: performance criteria Input data files Pick training and testing data for next round of cross validation • Hit Performance • Enrichment Factor (EF) • Sensitivity 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

  14. Results: hit performance

  15. Results: sensitivity • uncertainty • highest testing set sensitivity initially • no significant increase in testing set sensitivity

  16. Results: bagging vs. boosting • boosting • training set TP climbs faster, converges higher • overfits to the training data

  17. Results: # classifiers

  18. Conclusions • Sample selection • Bag vs. boost • Committee vs. single classifier • Testing set sensitivity • Trade off: exploration and exploitation

More Related