1 / 26

Active Learning in Comp Bio 02-750

Active Learning in Comp Bio 02-750. Jaime Carbonell , Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu /~jgc 4 September 2012. Why is Active Learning Important?. Labeled data volumes  unlabeled data volumes 1.2% of all proteins have known structures

Download Presentation

Active Learning in Comp Bio 02-750

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Learning in Comp Bio02-750 Jaime Carbonell, Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 4 September 2012

  2. Why is Active Learning Important? • Labeled data volumes  unlabeled data volumes • 1.2% of all proteins have known structures • < .01% of all galaxies in the Sloan Sky Survey have consensus type labels • < .0001% of all web pages have topic labels • << E-10% of all internet sessions are labeled as to fraudulence (malware, etc.) • < .0001 of all financial transactions investigated w.r.t. fraudulence • < .01% of all monolingual text is reliably bilingual • If labeling is costly, or limited, select the instances with maximal impact for learning Jaime G. Carbonell, Language Technolgies Institute

  3. Active Learning Relevant to Computational Biology • Protein-structure Learning • Classification into protein family, … • Inferring 3D structure from 2D sequence • Protein-protein interactions (PPIs) • Within-species (e.g. Human) pathways • Host-pathogen (e.g. Human-HIV1) • Evidence-Based Medicine • Cardiology (selecting diagnosis method) • Transplant (selecting immunosuppression) Jaime G. Carbonell, Language Technolgies Institute

  4. Active Learning Relevant to Computational Biology • Instance Selection in Classification • Protein family/sub-family classificaiton • Protein structure prediction • Motif/promoter-sequence prediction • Experiment/source Selection • Protein structure: MRI vs X-ray Crystalography • Granularity of Molecular Dynamics • Source of info for PPI network induction • Target/reactant selection in microarrays • Cascaded active learning (based on results of last experimental cycle) Jaime G. Carbonell, Language Technolgies Institute

  5. Active Learning • Training data: • Special case: • Functional space: • Fitness Criterion: • a.k.a. loss function • Sampling Strategy: Jaime G. Carbonell, Language Technolgies Institute

  6. “Myopic” Sampling Strategies • Random sampling (preserves distribution) • Uncertainty sampling (Lewis, 1996; Tong & Koller, 2000) • proximity to decision boundary • maximal distance to labeled x’s • Density sampling (kNN-inspired McCallum & Nigam, 2004) • Representative sampling (Xu et al, 2003) • Instability sampling (probability-weighted) • x’s that maximally change decision boundary • Ensemble Strategies • Boosting-like ensemble (Baram, 2003) • DUAL (Donmez & Carbonell, 2007) • Dynamically switches strategies • [See Settles 2010 review of Active Learning] Jaime G. Carbonell, Language Technolgies Institute

  7. Which point to sample? Grey = unlabeled Red = class A Brown = class B Jaime G. Carbonell, Language Technolgies Institute

  8. Density-Based Sampling Centroid of largest unsampled cluster Jaime G. Carbonell, Language Technolgies Institute

  9. Uncertainty Sampling Closest to decision boundary Jaime G. Carbonell, Language Technolgies Institute

  10. Maximal Diversity Sampling Maximally distant from labeled x’s Jaime G. Carbonell, Language Technolgies Institute

  11. Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria Jaime G. Carbonell, Language Technolgies Institute

  12. Strategy Selection: No Universal Optimum • Optimal operating range for AL sampling strategies differs • How to get the best of both worlds? • (Hint: ensemble methods, e.g. DUAL) Jaime G. Carbonell, Language Technolgies Institute

  13. How does DUAL do better? • Runs DWUS until it estimates a cross-over • Monitor the change in expected error at each iteration to detect when it is stuck in local minima • DUAL uses a mixture model after the cross-over ( saturation ) point • Our goal should be to minimize the expected future error • If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force • But in practice, we do not know it Jaime G. Carbonell, Language Technolgies Institute

  14. More on DUAL [ECML 2007] • After cross-over, US does better => uncertainty score should be given more weight • should reflect how well US performs • can be calculated by the expected error of US on the unlabeled data* => • Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to Jaime G. Carbonell, Language Technolgies Institute

  15. Results: DUAL vs DWUS Jaime G. Carbonell, Language Technolgies Institute

  16. Beyond Dual • Paired Sampling with Geodesic Density Estimation • Donmez & Carbonell, SIAM 2008 • Active Rank Learning • Search results: Donmez & Carbonell, WWW 2008 • In general: Donmez & Carbonell, ICML 2008 • Structure Learning • Inferring 3D protein structure from 1D sequence • Remains open problem Jaime G. Carbonell, Language Technolgies Institute

  17. Issues in Active Learning • Abundance of unlabeled examples • Paucity of labeled examples • High cost of labeling (experimentation, expert) • Selection of appropriate sampling strategies • Dependency on underlying ML method • What if labeling noise, variable costs, …? • Applications abound, including in Comp Bio • Tertiary/quaternary protein structure prediction • Protein-protein interaction prediction • Drug target selection Jaime G. Carbonell, Language Technolgies Institute

  18. Readings • Burr Settles – Comprehensive Survey of AL http://www.cs.cmu.edu/~bsettles/pub/settles.activelearning.pdf • Donmez, P. Carbonell, J. and Bennett, P. “Dual-Strategy Active Learning” http://www.cs.cmu.edu/~jgc/publication/Dual_Strategy_ECML_2007.pdf • Cohn, Ghahramani and Jordan, “Active Learning with Statistical Models” http://dspace.mit.edu/bitstream/handle/1721.1/7192/AIM-1522.pdf;jsessionid=13C2A9BF0DEC1567B9CA33F0C43BC3C3?sequence=2 Jaime G. Carbonell, Language Technolgies Institute

  19. THANK YOU! Jaime G. Carbonell, Language Technolgies Institute

  20. Active Sampling for RankSVM I • Consider a candidate • Assume is added to training set with • Total loss on pairs that include is: • n is the # of training instances with a different label than • Objective function to be minimized becomes: Jaime G. Carbonell, Language Technolgies Institute

  21. Active Sampling for RankSVM II • Assume the current ranking function is • There are two possible cases: • Assume • Derivative w.r.t at a single point or Jaime G. Carbonell, Language Technolgies Institute

  22. Active Sampling for RankSVM III • Substitute in the previous equation to estimate • Magnitude of the total derivative • estimates the ability of to change the current ranker if added into training • Finally, Jaime G. Carbonell, Language Technolgies Institute

  23. Active Sampling for RankBoost I • Again, estimate how the current ranker would change if was in the training set • Estimate this change by the difference in ranking loss before and after is added • Ranking loss w.r.t is (Freund et al., 2003): Jaime G. Carbonell, Language Technolgies Institute

  24. Active Sampling for RankBoost II • Difference in the ranking loss between the current and the enlarged set: • indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance • Finally, the instance with the highest loss differential is sampled: Jaime G. Carbonell, Language Technolgies Institute

  25. Performance Measures • MAP (Mean Average Precision) • MAP is the average of AP values for all queries • NDCG (Normalized Discounted Cumulative Gain) • The impact of each relevant document is discounted as a function of rank position Jaime G. Carbonell, Language Technolgies Institute

  26. Results on TREC03 Jaime G. Carbonell, Language Technolgies Institute

More Related