1 / 105

Machine Learning Part 2: Intermediate and Active Sampling Methods

Machine Learning Part 2: Intermediate and Active Sampling Methods. Jaime Carbonell (with contributions from Pinar Donmez and Jingrui He) Carnegie Mellon University jgc@cs.cmu.edu. Beyond “Standard” Learning:. Multi-Objective Learning Structuring Unstructured Data Text Categorization

boyd
Download Presentation

Machine Learning Part 2: Intermediate and Active Sampling Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine LearningPart 2: Intermediate and Active Sampling Methods Jaime Carbonell (with contributions from Pinar Donmez and Jingrui He) Carnegie Mellon University jgc@cs.cmu.edu © 2008, Jaime G Carbonell

  2. Beyond “Standard” Learning: • Multi-Objective Learning • Structuring Unstructured Data • Text Categorization • Temporal Prediction • Cycle & trend detection • Semi-Supervised Methods • Labeled + Unlabeled Data • Active Learning • Proactive Learning • “Unsupervised” Learning • Predictor attributes, but no explicit objective • Clustering methods • Rare category detection © 2008, Jaime G. Carbonell

  3. obj1 p1 p3 obj2 p4 p2 p5 p6 obj3 Predictor att’s Multi-Objective Supervised Learning • Several objectives to predict, overlapping sets of predictor attributes • Dependent case: sequence the predictions. If feedback, cycle until stability (or fixed N) --Independent predictions (each solved ignoring others) --Dependent predictions (results of earlier predictions partially feed next round) © 2008, Jaime G. Carbonell

  4. The Vector Space ModelHow to Convert Text to “Data” • Definitions of document and query vectors, where wj = jth word, and c(wj,di) = count the occurrences of wi in document dj • For topic-categorization use wn+1 as objective category to predict (e.g. “finance”, “sports”) © 2008, Jaime G. Carbonell

  5. Refinements to Word-Based Features • Well-known methods • Stop-word removal (e.g., “it”, “the”, “in”, …) • Phrasing (e.g., “White House”, “heart attack”, …) • Morphology (e.g., “countries” => “country”) • Feature Expansion • Query expansion (e.g., “cheap” => “inexpensive”, “discount”, “economic”,…) • Feature Transformation & Reduction • Singular-value decomposition (SVD) • Linear discriminant analysis (LDA) © 2008, Jaime G. Carbonell

  6. Query-Document Similarity (For Retrieval and for kNN) Traditional “Cosine Similarity” where: Each element in the query and document vectors are word weights Rare words count more, e.g.: di = log2(Dall/Dfreq(wordi)) Getting the top-k documents (or web pages) is done by: © 2008, Jaime G. Carbonell

  7. Multi-tier Text Categorization Given text, predict category at each level Issue: What if we need to go beyond words as features? © 2008, Jaime G. Carbonell

  8. Time Series Prediction Process • Find leading indicators • “predictor” variables from earlier epochs • Code values per distinct time interval • E.g. “sales at t-1, at t-2, t-3 …” • E.g. “advertisement $ at t, t-1, t-2” • Objective is to predict desired variable at current or future epochs • E.g. “sales at t, t+1, t+2” • Apply machine learning methods you learned • Regression, d-trees, kNN, Bayesian, … © 2008, Jaime G. Carbonell

  9. Time Series Prediction: caveat 1 2006 Total Sales Q1: 9.5M Q2: 8.5M Q3: 7.5M Q4: 11M 2008 Total Sales Q1: 12M Q2: 11M Q3: 9.5M Q4: ?? 2007 Total Sales Q1: 11M Q2: 10M Q3: 8.5M Q4: 13M • Determine periodic cycle • Find within-cycle trend • Find cross-cycle trend • Combine both components © 2008, Jaime G. Carbonell

  10. Time Series Prediction: caveat 2 2006 Total Sales Q1: 9.5M Q2: 8.5M Q3: 7.5M Q4: 11M 2008 Total Airline Sales Q1: 12M Q2: 11M Q3: 9.5M Q4: ?? 2007 Total Sales Q1: 11M Q2: 10M Q3: 8.5M Q4: 13M Watch for exogenous variable! (World-trade Center attack wreaked havoc with airline industry predictions)  Less tragic and less obvious one-of-a-kind events too © 2008, Jaime G. Carbonell

  11. Leveraging Existing Data Collecting Systems 1999 Influenza outbreak Influenza cultures Sentinel physicians WebMD queries about ‘cough’ etc. School absenteeism Sales of cough and cold meds Sales of cough syrup ER respiratory complaints ER ‘viral’ complaints Influenza-related deaths Week (1999-2000)) © 2008, Jaime G. Carbonell [Moore, 2002]

  12. Adaptive Filtering over a Document Stream Training documents (past) Test documents time Topic 1 Topic 2 Topic 3 … Current document: On-topic? Unlabeled documents On-topic documents RF Off-topic documents © 2008, Jaime G. Carbonell

  13. MLR threshold function: locally linear, globally non-linear Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR © 2008, Jaime G. Carbonell

  14. Time Series in a Nutshell • Time-Series Prediction requires regression, except • Historical data per time period (aka “epoch”) • Predictor attributes come from both current + earlier epochs • Objective attribute from earlier epochs  predictor attributes for current epoch • Process Difference with Normal Machine Learning • First detect cyclical patterns among epochs • Predict within a cycle • Predict cross-cycle using corresponding epochs only (then combine with within-cycle prediction) © 2008, Jaime G. Carbonell

  15. Active Learning • Assume: • Very few “labeled” instances • Very many “unlabeled” instances • An omniscient “oracle” which can assign an label to an unlabeled instance • Objective: • Select instances to label such that learning accuracy is maximized with the fewest oracle labeling requests © 2008, Jaime G. Carbonell

  16. learn a new model label request labeled data Active Learning (overall idea) Data Source unlabeled data Learning Mechanism User output Expert

  17. Why is Active Learning Important? • Labeled data volumes  unlabeled data volumes • 1.2% of all proteins have known structures • .01% of all galaxies in the Sloan Sky Survey have consensus type labels • .0001% of all web pages have topic labels • If labeling is costly, or limited, we want to select the points that will have maximal impact © 2008, Jaime G. Carbonell

  18. Review of Supervised Learning • Training data: • Functional space: • Fitness Criterion: • Variants: online learning, noisy data, … © 2008, Jaime G. Carbonell

  19. Active Learning • Training data: • Special case: • Functional space: • Fitness Criterion: • a.k.a. loss function • Sampling Strategy: © 2008, Jaime G. Carbonell

  20. Sampling Strategies • Random sampling (preserves distribution) • Uncertainty sampling (Tong & Koller, 2000) • proximity to decision boundary • maximal distance to labeled x’s • Density sampling (kNN-inspired McCallum & Nigam, 2004) • Representative sampling (Xu et al, 2003) • Instability sampling (probability-weighted) • x’s that maximally change decision boundary • Ensemble Strategies • Boosting-like ensemble (Baram, 2003) • DUAL (Donmez & Carbonell, 2007) • Dynamically switches strategies from Density-Based to Uncertainty-Based by estimating derivative of expected residual error reduction © 2008, Jaime G. Carbonell

  21. Which point to sample? Green = unlabeled Red = class A Brown = class B

  22. Density-Based Sampling Centroid of largest unsampled cluster

  23. Uncertainty Sampling Closest to decision boundary

  24. Maximal Diversity Sampling Maximally distant from labeled x’s

  25. Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria

  26. Active Learning Issues • Interaction of active sampling with underlying classifier(s). • On-line sampling vs. batch sampling. • Active sampling for rank learning and for structured learning (e.g. HMMs, sCRFs). • What if Oracle is fallible, or reluctant, or differentially expensive  proactive learning. • How does noisy data affect active learning? • What if we do not have even the first labeled point(s) for one or more classes?  new class discovery. • How to “optimally” combine A.L .strategies © 2008, Jaime G. Carbonell

  27. Strategy Selection: No Universal Optimum • Optimal operating range for AL sampling strategies differs • How to get the best of both worlds? • (Hint: ensemble methods, e.g. DUAL) © 2008, Jaime G. Carbonell

  28. Motivation for DUAL • Strength of DWUS: • favors higher density samples close to the decision boundary • fast decrease in error • But! DWUS establishes diminishing returns! Why? • Early iterations -> many points are highly uncertain • Later iterations -> points with high uncertainty no longer in dense regions • DWUS wastes time picking instances with no direct effect on the error © 2008, Jaime G. Carbonell

  29. How does DUAL do better? • Runs DWUS until it estimates a cross-over • Monitor the change in expected error at each iteration to detect when it is stuck in local minima • DUAL uses a mixture model after the cross-over ( saturation ) point • Our goal should be to minimize the expected future error • If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force • But in practice, we do not know it © 2008, Jaime G. Carbonell

  30. More on DUAL • After cross-over, US does better => uncertainty score should be given more weight • should reflect how well US performs • can be calculated by the expected error of US on the unlabeled data* => • Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to © 2008, Jaime G. Carbonell

  31. Results: DUAL vs DWUS © 2008, Jaime G. Carbonell

  32. Paired Density-Based Sampling(Donmez & Carbonell, 2008) • Desiderata • Balanced Sampling from both (all) classes • Combine density-based with coverage-based • Method • Non-Euclidian distance function • Select maximally separated pairs of points based on maximizing a utility function © 2008, Jaime G. Carbonell

  33. Paired Density Method (cont.) • Utility function: • Select the two points that optimize utility and are maximally distant © 2008, Jaime G. Carbonell

  34. Results of Paired-Density Sampling © 2008, Jaime G. Carbonell

  35. Active Learning model in NLP Test Data Evaluation Training Data Parsing model Machine Translation System Named Entity Recognition module Word Sense Disambiguation model Active Learner Build Sample selection Addition Unlabeled Set Active Training Set Un-annotated corpus Samples Annotation Translation

  36. Word-Sense Disambiguation • Needed in NLP for parsing, translation, search… • Example: • Line  ax+by+c, rope, queue, track,… • “Banco”  bench, financial inst, sand bank, … • Challenge: How to disambiguate from context • Approach: Build ML classifier (sense = class) • Problem: Insufficient training data • Amelioration: Active Learning © 2008, Jaime G. Carbonell

  37. Word Sense Disambiguation:Active Learning Methods • Entropy Sampling • Vector q represents the trained model’s predictions • qc prediction probability of class c • Pick the example whose prediction vector displays the greatest entropy • Margin Sampling • If c and c’ are the two most likely categories Picks the example with the smallest margin © 2008, Jaime G. Carbonell

  38. Word Sense Disambiguation: Experiment‏ On 5 English verbs that had coarse grained senses. • Double-blind tagging applied to 50 instances of the target word • If the inter-tagger (ITA) agreement < 90%, the sense entry is revised by adding examples and explanations © 2008, Jaime G. Carbonell

  39. Word Sense Disambiguation Results

  40. All x’s cost the same to label Max number of labels Omniscient oracle Never errs Indefatigable oracle Always answers Single oracle Oracle selection unnecessary Labeling cost is f1(D(x),O) Max labeling budget Fallible oracles Errs with p(E(x)) ~ f2(D(x),O) Reluctant oracles Answers with p(A(x)) … Multiple oracles Joint optimization of oracle and instance selection Active vs. Proactive Learning ACTIVE LEARNING PROACTIVE LEARNING © 2008, Jaime G. Carbonell

  41. Scenario 1: Reluctance • 2 oracles: • reliable oracle: expensive but always answers with a correct label • reluctant oracle: cheap but may not respond to some queries • Define a utility score as expected value of information at unit cost © 2008, Jaime G. Carbonell

  42. How to estimate ? • Cluster unlabeled data using k-means • Ask the label of each cluster centroid to the reluctant oracle. If • label received: increase of nearby points • no label: decrease of nearby points equals 1 when label received, -1 otherwise • # clusters depend on the clustering budget and oracle fee © 2008, Jaime G. Carbonell

  43. Algorithm for Scenario 1 © 2008, Jaime G. Carbonell

  44. Scenario 2: Fallibility • Two oracles: • One perfect but expensive oracle • One fallible but cheap oracle, always answers • Alg. Similar to Scenario 1 with slight modifications • During exploration: • Fallible oracle provides the label with its confidence • Confidence = of fallible oracle • If then we don’t use the label but we still update © 2008, Jaime G. Carbonell

  45. Scenario 3: Non-uniform Cost • Uniform cost: Fraud detection, face recognition, etc. • Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. • 2 oracles: • Fixed-cost Oracle • Variable-cost Oracle © 2008, Jaime G. Carbonell

  46. Outline of Scenario 3 © 2008, Jaime G. Carbonell

  47. Underlying Sampling Strategy • Conditional entropy based sampling, weighted by a density measure • Captures the information content of a close neighborhood close neighbors of x © 2008, Jaime G. Carbonell

  48. Results: Reluctance © 2008, Jaime G. Carbonell

  49. Cost varies non-uniformly statistically significant (p<0.01) © 2008, Jaime G. Carbonell

  50. Proactive Learning in General • Multiple Expert (a.k.a. Oracles) • Different areas of expertise • Different costs • Different reliabilities • Different availability • What question to ask and whom to query? • Joint optimization of query & oracle selection • Referals among Oracles (with referal fees) • Learn about Oracle capabilities as well as solving the Active Learning problem at hand © 2008, Jaime G. Carbonell

More Related