270 likes | 391 Views
DUAL STRATEGY ACTIVE LEARNING. presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie Mellon University 2 Microsoft Research. learn a new model. label request. labeled data. Active Learning (Pool-based).
E N D
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez1 Joint work with Jaime G. Carbonell1 & Paul N. Bennett2 1 Language Technologies Institute, Carnegie Mellon University 2 Microsoft Research
learn a new model label request labeled data Active Learning (Pool-based) Data Source unlabeled data Learning Mechanism User output Expert
Two different trends on Active Learning • Uncertainty Sampling: • selects the example with the lowest certainty • i.e. closest to the boundary, maximum entropy,... • Density-based Sampling: • considers the underlying data distribution • selects representatives of large clusters • aims to cover the input space quickly • i.e. representative sampling, active learning using pre-clustering, etc.
Goal of this Work • Find an active learning method that works well everywhere • Some work best when very few instances sampled (i.e. density-based sampling) • Some work best after substantial sampling (i.e. uncertainty sampling) • Combine the best of both worlds for superior performance
Main Features of DUAL • DUAL • is dynamic rather than static • is context-sensitive • builds upon the work titled “Active Learning with Pre-Clustering”, (Nguyen & Smeulders, 2004) • proposes a mixture model of density and uncertainty • DUAL’s primary focus is to • outperform static strategies over a large operating range • improve learning for the later iterations rather than concentrating on the initial data labeling
Active Learning with Pre-Clustering • We call it Density Weighed Uncertainty Sampling (DWUS in short). Why? • assumes a hidden clustering structure of the data • calculates the posterior P(y | x) as • x and y are conditionally independent given k since points in one cluster assumed to share the same label selection criterion [1] uncertainty score density score [2] [3]
Outline of DWUS • Cluster the data using K-medoid algorithm to find the cluster centroids ck • Estimate P(k|x) by a standard EM procedure • Model P(y|k) as a logistic regression classifier • Estimate P(y|x) using • Select an unlabeled instance using Eq. 1 • Update the parameters of the logistic regression model (hence update P(y|k) ) • Repeat steps 3-5 until stopping criterion
Notes on DWUS • Posterior class distribution: • P(y | k) is calculated via • P(k|x) is estimated using an EM procedure after the clustering • p(x | k) is a multivariate Gaussian with the same σ for all clusters • The logistic regression model to estimate parameters
Motivation for DUAL • Strength of DWUS: • favors higher density samples close to the decision boundary • fast decrease in error • But! DWUS establishes diminishing returns! Why? • Early iterations -> many points are highly uncertain • Later iterations -> points with high uncertainty no longer in dense regions • DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better? • Runs DWUS until it estimates a cross-over • Monitor the change in expected error at each iteration to detect when it is stuck in local minima • DUAL uses a mixture model after the cross-over ( saturation ) point • Our goal should be to minimize the expected future error • If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force • But in practice, we do not know it
More on DUAL • After cross-over, US does better => uncertainty score should be given more weight • should reflect how well US performs • can be calculated by the expected error of US on the unlabeled data* => • Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set
2 2 2 2 2 2 2 2 2 A simple Illustration I 1 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 2 1 1 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2
2 2 2 2 2 2 2 2 2 A simple Illustration II 1 1 1 1 2 1 1 1 1 1 2 1 1 2 2 2 2 1 1 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2
2 2 2 2 2 2 2 2 2 A simple Illustration III 1 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 2 1 1 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2
2 2 2 2 2 2 2 2 2 A simple Illustration IV 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2
Experiments • initial training set size : 0.4% of the entire data ( n+ = n- ) • The results are averaged over 4 runs, each run takes 100 iterations • DUAL outperforms • DWUS with p<0.0001 significance* after 40th iteration • Representative Sampling (p<0.0001) on all • COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and M-vs-N • US (p<0.001) on 5 datasets • DS (p<0.0001) on 5 datasets * All significance results are based on a 2-sided paired t-test on the classification error
Failure Analysis • Current estimate of the cross-over point is not accurate on V-vs-Y dataset => simulate a better error estimator • Currently, DUAL only considers the performance of US. But, on Splice DS is better => modify selection criterion:
Conclusion • DUAL robustly combines density and uncertainty (can be generalized to other active sampling methods which exhibit differential performance) • DUAL leads to more effective performance than individual strategies • DUAL shows the error of one method can be estimated using the data labeled by the other • DUAL can be applied to multi-class problems where the error is estimated either globally or at the class or the instance level
Future Work • Generalize DUAL to estimate which method is currently dominant or use a relative success weight • Apply DUAL to more than two strategies to maximize the diversity of an ensemble • Investigate better techniques to estimate the future classification error
The error expectation for a given point: • Data density is estimated as a mixture of K Gaussians: • EM procedure to estimate P(K): • Likelihood: