1 / 80

Active Learning for Regression and Optimization Jeff Schneider

Active Learning for Regression and Optimization Jeff Schneider. Active Learning for Regression and Optimization. ??. ??. Placing ads on web pages selecting parameters for protein crystallization robot skill learning (e.g. snakes)

Download Presentation

Active Learning for Regression and Optimization Jeff Schneider

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Learning for Regression and OptimizationJeff Schneider

  2. Active Learning for Regression and Optimization ?? ??

  3. Placing ads on web pages • selecting parameters for protein crystallization • robot skill learning (e.g. snakes) • parameters for clustering algorithms to produce the best clusters • fitting scientific simulation models (e.g. cosmology) • when to take measurements of expression data in microarrays • drug discovery Some Applications

  4. Active Learning for Regression and Optimization ?? ??

  5. Active Learning in Regression • Optimal Methods • expected variance over input space • total entropy • Myopic Methods • expected reduction in variance after one experiment • Heuristic Methods • widest confidence interval (you need a function approximator that provides confidence intervals or distributions on outputs)

  6. Active Learning for Regression and Optimization ?? ??

  7. Active Learning for Optimization in Regression • Instead of mapping the function, suppose you want to find the optimal point on it • Optimal Method • expected value of best point found • Myopic Method • compute the expected improvement over the best so far • Heuristic Methods • widest confidence interval • highest upper confidence interval • highest lower confidence interval (safety)

  8. Active Learning for Regression and Optimization ?? ??

  9. WIN! Optimization over Discrete Alternatives • Bandit Problems [Gittins 89] [Berry and Fristedt 85] • A type of active learning problem that can be solved optimally • k-armed bandit: arms are assumed to be independent • what happens if they are not independent? how might this happen? Objective is to choose arm pulls to maximize time discounted rewards

  10. A Markov Decision Process g = 0.9 1 S You run a startup company. In every state you must choose between Saving money or Advertising. 1 Poor & Unknown +0 Poor & Famous +0 1/2 A A 1/2 S 1/2 1/2 1 1/2 1/2 1/2 A S A 1/2 Rich & Famous +10 Rich & Unknown +10 S 1/2 1/2

  11. Markov Decision Processes An MDP has… • A set of states {s1··· SN} • A set of actions {a1··· aM} • A set of rewards {r1··· rN} (one for each state) • A transition probability function On each step: 0. Call current state Si 1. Receive reward ri 2. Choose action  {a1 ··· aM} 3. If you choose action ak you’ll move to state Sj with probability 4. All future rewards are discounted by g

  12. Bellman’s Equation Value Iteration for solving MDPs • Compute J1(Si) for all i • Compute J2(Si) for all i • : • Compute Jn(Si) for all i • …..until converged …Also known as Dynamic Programming

  13. WIN! Active learning with discrete alternatives • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 0.5 0.33 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5

  14. WIN! Active learning with discrete alternatives • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 M $ 0.5 0.33 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5

  15. WIN! Active learning with discrete alternatives • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 M $ 0.5 0.33 For mi $ I am indifferent in my action choice 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5 m1 m2 m3 m4 m5

  16. WIN! Active learning with discrete alternatives • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 M $ 0.5 0.33 0.33 0.5 1 win 2 loss 2 win 2 loss n trials mean m std dev s 0.67 0.5

  17. Machine Learning in (in vivo CNS) Drug Discovery Jeff Schneider Associate Research Professor School of Computer Science Carnegie Mellon University (former) Chief Informatics Officer Psychogenics, Inc.

  18. Machine Learning Application: Drug Discovery “Psychogenics is in the business of science for profit” -- Bill Fasnacht, 2004 Psychogenics CFO

  19. Machine Learning Application: Drug Discovery “Psychogenics is in the business of science for profit” -- Bill Fasnacht, 2004 Psychogenics CFO “Science is a process that should be modeled and optimized with machine learning” -- Jeff Schneider, 2004

  20. Drug Discovery Figures • 1060 small drug-like molecules • Compounds considered per drug approved: tens of thousands • Total time to develop a drug: 12-17 years • Cost per approved drug: ~$500M • Remaining patent life upon approval for marketing ~7 years • Required annual sales to recoup costs: ~$500M • Number achieving those sales: dozens

  21. Drug Discovery Process In vitro search for compounds (leads) that interact with the target Choose a disease Identify and validate a target Optimize lead for potency, PK, toxicity, etc. Clinical studies in humans Pre-clinical in vitro and in vivo studies Marketed Drug!!

  22. publication! Scientific Discovery! A Not-so-serious View of the Scientific Method test development data analysis paper review paper submission Add more of your cologne, parfum de chat This is not a tanning booth! Maybe if I hold it upside down ... I think I found your outlier This will be easier after I toss the ones that don’t cite me replicating results resource allocation Is this what they meant by “2 paw rearing?” RA pizza parties are back! grant review and award I can’t understand the proposal, but the CV is impressive Your money is in good hands

  23. publication! replicating results Scientific Discovery! Is this what they meant by “2 paw rearing?” A Not-so-serious View of the Scientific Method test development data analysis review paper submission Add more of your cologne, parfum de chat Maybe if I hold it upside down ... I think I found your outlier This will be easier after I toss the ones that don’t cite me This is not a tanning booth! Learning from a Training Set resource allocation RA pizza parties are back! Validation with a Test Set grant review and award I can’t understand the proposal, but the CV is longer than mine Your money is in good hands

  24. publication! replicating results Scientific Discovery! Is this what they meant by “2 paw rearing?” A Not-so-serious View of the Scientific Method test development data analysis review paper submission Add more of your cologne, parfum de chat Maybe if I hold it upside down ... I think I found your outlier This will be easier after I toss the ones that don’t cite me This is not a tanning booth! resource allocation Experiment Design (aka Active Learning) RA pizza parties are back! grant review and award I can’t understand the proposal, but the CV is longer than mine Your money is in good hands

  25. Psychogenics in vivo platform

  26. System Overview Drug Signatures and Classifiers Selected Compounds Proprietary Hardware Computer Vision Pattern Recognition DataBase Data Mining • Run reference compounds to build classifiers • Run novel compounds to screen for new drugs

  27. System Overview Proprietary Hardware Selected Compounds Drug Signatures and Classifiers Computer Vision Pattern Recognition DataBase Data Mining 3. Active learning (experiment design) decides which compounds get more mice

  28. SmartCube Features (more than 2000!) • Behaviors: anogenital grooming, free rearing, supported rearing, immobile, mobile, stretch attend, approach, misstep, digging, none • transition probabilities between behaviors • amount of time in each behavior • mean, variance of tail position, body ellipticity, height • average time, distance rotating in one direction before switching • fractal measures of motion: time and space based • average speed and distance from center • startle latency, amplitude, and integral (with and without prepulse, PPI) • “home corner” features • number of shock probe contacts and total time in contact • features divided into 12 time segments, one for each trial phase • ratios of all features between the two shock, startle, and towers periods

  29. 1 dimension 1 Gaussian 1 dimension 1 Gaussian Data Mining vs Traditional Statistics: Gaussian Mixture Models Example Data Mining Approach: Can a classifier predict which samples are from which group? Traditional t-test: Are the means different? Data Mining: Extends easily to multiple features and multiple Gaussians

  30. feature_973 antidepressantantipsychoticanxiolyticvehicle feature_1782 feature_913 feature_1524 feature_890 feature_38 feature_1643 feature_1231 Data Mining – Decision Trees

  31. Bagging Classifiers Problem: Decision trees are well known to overfit data. Solution: Learn many different copies of the classifiers and have them combine their predictions to get a final result. • Bagging • Create bootstrap sample: resample training data with replacement to create a new data set. • Learn a classifier. • Repeat. • Predict based on a vote of all classifiers.

  32. Test Set Evaluation • Many mice are run with and without a treatment • Choose 1/3 of the mice at random and remove them • build a classifier using the training data • check its accuracy on the test data • repeat lots of times and compute the average accuracy the rest are the training set treatment vehicle these form a test set

  33. 1 dimension 1 Gaussian 1 dimension 1 Gaussian A Data Mining Replacement for the t-test Data Mining Approach: Can a classifier predict which samples are from which group? Traditional t-test: Are the means different? • Goal: Determine if compound creates any behavioral effects distinguishable from its vehicle • Problem: We have thousands of features! • Approach: Attempt to learn a classifier that can predict which treatment (vehicle or compound) was given • Statistical test: is the test set accuracy significantly greater than chance? (chance = 0.50 with 95% confidence intervals of +/- 0.15)

  34. A Data Mining Replacement for the t-test Dose Response Time Response Some Reference Drug Dose Responses

  35. antidepressant antipsychotic anxiolytic vehicle trycyclic SSRI typical atypical 5ht1a ag benzo Drug Class Subclass Drug Name Dose lo hi Evaluation for Classification • Reference library is created from known drugs • many mice are tested for each dose of each drug • we use this to build classifiers

  36. antidepressant antipsychotic anxiolytic vehicle trycyclic SSRI typical atypical 5ht1a ag benzo Drug Class Subclass Drug Name Dose lo hi Evaluation for Classification • Test set – keep out 1/3 of trials at random • learn a classifier from the training set (like test development process) • check accuracy on the test set (like replication process) • evaluates ability to recognize drugs that have been seen before

  37. antidepressant antidepressant antipsychotic antipsychotic anxiolytic anxiolytic vehicle vehicle trycyclic trycyclic SSRI SSRI typical typical atypical atypical 5ht1a ag 5ht1a ag benzo benzo Drug Class Subclass Drug Name Dose Drug Class Subclass Drug Name Dose lo lo hi hi Evaluation for Classification • Test set – keep out 1/3 of trials at random • learn a classifier from the training set (like test development process) • check accuracy on the test set (like replication process) • evaluates ability to recognize drugs that have been seen before • Leave whole drug out – keep all trials from one drug out • evaluates ability to recognize novel compounds

  38. Classification Validation Results • thousands of mice • thousands of features for classification • 70 drugs in 13 classes (including vehicle) • 145 drug/dose combinations

  39. Estimating Class/Next Experiment Probabilities Suppose I’ve seen 2 mice classified as anxiolytic and 2 mice classified as vehicle. What is the probability the compound they were given belongs to each of the classes? What classification do I expect to see for the next mouse?

  40. Class Membership Probability Bayes Rule: P(classj|data) = P(data|classj) * P(classj) / P(data) • Use the confusion matrix from reference library validation experiments for active drugs and vehicles • Choose expected distribution of compounds to be screened ... ...

  41. Class Membership Probability P(classj|data) = P(data|classj) * P(classj) / P(data) Use multinomial distribution to determine probability of data for each drug in a class. e.g. p(2 vehicle, 2 anxiolytic | drug = tween) = 0.42*0.42*0.05*0.05 * 6

  42. Distribution of Drugs in a Class Two examples of multinomial distributions for the drugs in a class

  43. Smooth Bootstrap to Sample Drug Density in a Class • Use kernel density estimator (KDE) with dirichlet kernel on each class distribution • Draw samples from the KDE • Use samples instead in the sum:

  44. Empirical Evaluation • Count Method • count the mice predicted in each class and choose the class with the most • accuracy: 68.9% • Bayesian Method • compute posterior probabilities and choose the class with the largest • accuracy: 75.0% • 13 drug classes • n = 145 drug/dose combinations • Evaluation protocol • Remove 10% of drugs from data at random to form a test set • Construct classifiers and Bayesian probability model • Predict drugs in the test set • Repeat 100 times

  45. Probability Distribution for Next Data Point Suppose I’ve seen 2 mice classified as anxiolytic and 2 mice classified as vehicle. What classification do I expect to see for the next mouse? same multinomial used for class probabilities class probabilities

  46. Active Learning: Which Compound gets the next Mouse?

  47. WIN! Gittins’ indices and k-armed bandits • Bandit Problems [Gittins 89] [Berry and Fristedt 85] Choose arm pulls to maximize time discounted rewards

  48. WIN! Gittins’ index algorithm • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 0.5 0.33 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5

  49. WIN! Gittins’ index algorithm • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 M $ 0.5 0.33 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5

  50. WIN! Gittins’ index algorithm • Model each arm as a Markov Chain • A “pull” means taking one more time step in the MC • Create an Markov Decision Process by introducing a binary action choice at each state giving the option to continue on in the MC or “bailout” for a fixed payoff. • Solve the MDP for many values of the bailout payoff. • The Gittins index for each state is the value of the payoff for which there is indifference in the action choice. • Pull the arm with the highest index. 0.5 1 win 1 loss 2 win 1 loss 0.67 M $ 0.5 0.33 For mi $ I am indifferent in my action choice 0.33 0.5 1 win 2 loss 2 win 2 loss 0.67 0.5 m1 m2 m3 m4 m5

More Related