1.1k likes | 1.14k Views
Learn about active learning methods for accurate classification with minimal cost, including selective sampling, membership queries, uncertainty sampling, and Query by Committee. Explore applications and methodologies in collaborative prediction and Bayesian network learning.
E N D
Active LearningCOMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
Outline • Motivation • Active learning approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee • Applications • Active Collaborative Prediction • Active Bayes net learning
Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/, in the realizable case.
Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?
What is Active Learning? • Unlabeled data are readily available; labels are expensive • Want to use adaptive decisions to choose which labels to acquire for a given dataset • Goal is accurate classifier with minimal cost
Active learning warning • Choice of data is only as good as the model itself • Assume a linear model, then two data points are sufficient • What happens when data are not linear?
Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch
Active Learning Approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee
Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”
Uncertainty Sampling [Lewis & Gale, 1994] • Query the event that the current classifier is most uncertain about • Used trivially in SVMs, graphical models, etc. If uncertainty is measured in Euclidean distance: x x x x x x x x x x
Information-based Loss Function [MacKay, 1992] • Maximize KL-divergence between posterior and prior • Maximize reduction in model entropy between posterior and prior • Minimize cross-entropy between posterior and prior • All of these are notions of information gain
Query by Committee [Seung et al. 1992, Freund et al. 1997] • Prior distribution over hypotheses • Samples a set of classifiers from distribution • Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x C B A
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q
Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t
Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC
Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC Possible Model Structures
Model Space St Ot Model: Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w)
Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: • Posterior over models:
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label
Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D
Simulation ? ? S7=false S2=false S5=true ? O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 hmm… q
Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door)
q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient?
Uncertainty Sampling Example FALSE TRUE
Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples
Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0
Current model space entropy Expected model space entropy if we learn St Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St)
We usually can’t just sum over all models to get H(St|W) …but we can sample from P(W | D)
Score Function Familiar?
If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”