240 likes | 264 Views
A Theoretical Model for Learning from Labeled and Unlabeled Data. Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department. What is Machine Learning?. Design of programs that adapt from experience, identify patterns in data. Used to:
E N D
A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan
What is Machine Learning? • Design of programs that adapt from experience, identify patterns in data. • Used to: • recognize speech, faces, … • categorize documents, info retrieval, ... • Goals of ML theory:develop models, analyze algorithmic and statistical issues involved. Maria-Florina Balcan
Outline of the talk • Brief Overview of Supervised Learning • PAC Model • Semi-Supervised Learning • An Augmented PAC Style Model Maria-Florina Balcan
Usual Supervised Learning Problem • Decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule h(a "hypothesis") for future data. Maria-Florina Balcan
The Concept Learning Setting E.g., example label • Given data, some reasonable rules might be: • Predict SPAM if unknown AND (money OR pills) • Predict SPAM if money + pills – known > 0 • ... Maria-Florina Balcan
Supervised Learning, Big Questions • Algorithm Design. How to optimize? • How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound • Real goal is to do well on new data. • What kind of confidence do we have that rules that do well on sample will do well in the future? • for a given learning alg, how much data do we need... Maria-Florina Balcan
Supervised Learning: Formalization (PAC) • PAC model – standard model for learning from labeled data. • Have sample S = {(x,l)} drawn from some distrib Dover examples x 2 X, labeledby some target functionc*. • Alg does optimization over Sto produce some hypothesis h 2 C(e.g., C = linear separators). • Goal is for h to be close to c* over D. • err(h)=Prx 2 D(h(x) c*(x)) • Allow failure with small probability (to allow for chance that S is not representative). Maria-Florina Balcan
The Issue of Sample-Complexity • We want to do well on D, but all we have is S. • Are we in trouble? • How big does S have to be so that low error onS implies low error onD? • Luckily, sample-complexity bounds. • Algorithm: Pick a concept that agrees with S. • Sample Complexity Statement: • If |S| ¸ (1/)[log|C| + log 1/], then with probability at least (1-), all h 2 C that agree with sample S have true error ·. Maria-Florina Balcan
Outline of the talk • Brief Overview of Supervised Learning • PAC Model • Semi-Supervised Learning • An Augmented PAC Style Model Maria-Florina Balcan
Combining Labeled and Unlabeled Data • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan
An Augmented PAC style Model for Semi-Supervised Learning • Extends PAC naturally to the case of learning from both labeled and unlabeleddata. • Unlabeled data is useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. • Different algorithms are based on differentassumptions about how data should behave. • Question – how to capture many of the assumptions typically used? Maria-Florina Balcan
_ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Example of “typical” assumption • The separator goes through low density regions of the space/large margin. • assume we are looking for linear separator • belief: there should exist one with large separation Maria-Florina Balcan
Another Example • Agreement between two parts : co-training. • examples contain two sufficient sets of features • i.e. an example is x=h x1, x2i • belief: the two parts of the example are consistent • 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: Maria-Florina Balcan
Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Another Example, cont • Agreement between two parts : co-training. • examples contain two sufficient sets of features • i.e. an example is x=h x1, x2i • belief: the two parts of the example are consistent • 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: x = hx1, x2i Maria-Florina Balcan
Text info Link info X2 X1 + + My Advisor + + Co-Training Works by using unlabeled data to propagate learned information. Maria-Florina Balcan
Semi-Supervised Learning Formalization. Main Idea • Augment the notion of a conceptclassC with a notion of compatibility between a concept and the data distribution ((h,D) 2 [0,1]). • “Learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ). • Express relationships that one hopes the target function and underlying distribution will possess. Maria-Florina Balcan
Semi-Supervised Learning Formalization. Main Idea • Augment the notion of a conceptclassC with a notion of compatibility between a concept and the data distribution ((h,D) 2 [0,1]). • “Learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ). • Express relationships that one hopes the target function and underlying distribution will possess. • Use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan
Semi-Supervised Learning Formalization. Main Idea, cont • Use unlabeled data & our belief to reducesize(C) down to size(highly compatible functions in C) in our sample complexity bounds. • Need to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan
_ + Highly compatible + _ Margins, Compatibility • Margins: belief is that there should exist a separator with margin . • (h, D) =1-(the probability mass within distance of h). • can be can be estimated from a finite sample. Maria-Florina Balcan
Types of Results in Our Model • As in the usual PAC model, can discuss algorithmic and sample complexity issues. • Can analyze how much unlabeled data we need to see: • depends both on the complexity of C and the complexity of our notion of compatibility. • Can analyze the ability of a finite unlabeled sample to reduce our dependence on labeled examples: • as a function of compatibility of the target function and various measures of the helpfulness of the distribution. Maria-Florina Balcan
Examples of Results in Our Model • Algorithm: pick a compatible concept that agrees with the labeled sample. • Sample Complexity Statement: Maria-Florina Balcan
+ _ + Highly compatible _ Examples of Results in Our Model, cont. • Algorithm: pick a compatible concept that agrees with the labeled sample. • Sample Complexity Statement: Maria-Florina Balcan
Summary • Provided a PAC style model for semi-supervised learning. • Captures many of the ways in which unlabeled data is typically used. • Unified framework for analyzing when and why unlabeled data can help. • Can get much better bounds in terms of labeled examples. Maria-Florina Balcan
Thank you ! Maria-Florina Balcan