1 / 24

A Theoretical Model for Learning from Labeled and Unlabeled Data

A Theoretical Model for Learning from Labeled and Unlabeled Data. Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department. What is Machine Learning?. Design of programs that adapt from experience, identify patterns in data. Used to:

ascruggs
Download Presentation

A Theoretical Model for Learning from Labeled and Unlabeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan

  2. What is Machine Learning? • Design of programs that adapt from experience, identify patterns in data. • Used to: • recognize speech, faces, … • categorize documents, info retrieval, ... • Goals of ML theory:develop models, analyze algorithmic and statistical issues involved. Maria-Florina Balcan

  3. Outline of the talk • Brief Overview of Supervised Learning • PAC Model • Semi-Supervised Learning • An Augmented PAC Style Model Maria-Florina Balcan

  4. Usual Supervised Learning Problem • Decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule h(a "hypothesis") for future data. Maria-Florina Balcan

  5. The Concept Learning Setting E.g., example label • Given data, some reasonable rules might be: • Predict SPAM if unknown AND (money OR pills) • Predict SPAM if money + pills – known > 0 • ... Maria-Florina Balcan

  6. Supervised Learning, Big Questions • Algorithm Design. How to optimize? • How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound • Real goal is to do well on new data. • What kind of confidence do we have that rules that do well on sample will do well in the future? • for a given learning alg, how much data do we need... Maria-Florina Balcan

  7. Supervised Learning: Formalization (PAC) • PAC model – standard model for learning from labeled data. • Have sample S = {(x,l)} drawn from some distrib Dover examples x 2 X, labeledby some target functionc*. • Alg does optimization over Sto produce some hypothesis h 2 C(e.g., C = linear separators). • Goal is for h to be close to c* over D. • err(h)=Prx 2 D(h(x)  c*(x)) • Allow failure with small probability  (to allow for chance that S is not representative). Maria-Florina Balcan

  8. The Issue of Sample-Complexity • We want to do well on D, but all we have is S. • Are we in trouble? • How big does S have to be so that low error onS implies low error onD? • Luckily, sample-complexity bounds. • Algorithm: Pick a concept that agrees with S. • Sample Complexity Statement: • If |S| ¸ (1/)[log|C| + log 1/], then with probability at least (1-), all h 2 C that agree with sample S have true error ·. Maria-Florina Balcan

  9. Outline of the talk • Brief Overview of Supervised Learning • PAC Model • Semi-Supervised Learning • An Augmented PAC Style Model Maria-Florina Balcan

  10. Combining Labeled and Unlabeled Data • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan

  11. An Augmented PAC style Model for Semi-Supervised Learning • Extends PAC naturally to the case of learning from both labeled and unlabeleddata. • Unlabeled data is useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. • Different algorithms are based on differentassumptions about how data should behave. • Question – how to capture many of the assumptions typically used? Maria-Florina Balcan

  12. _ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Example of “typical” assumption • The separator goes through low density regions of the space/large margin. • assume we are looking for linear separator • belief: there should exist one with large separation Maria-Florina Balcan

  13. Another Example • Agreement between two parts : co-training. • examples contain two sufficient sets of features • i.e. an example is x=h x1, x2i • belief: the two parts of the example are consistent • 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: Maria-Florina Balcan

  14. Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Another Example, cont • Agreement between two parts : co-training. • examples contain two sufficient sets of features • i.e. an example is x=h x1, x2i • belief: the two parts of the example are consistent • 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: x = hx1, x2i Maria-Florina Balcan

  15. Text info Link info X2 X1 + + My Advisor + + Co-Training Works by using unlabeled data to propagate learned information. Maria-Florina Balcan

  16. Semi-Supervised Learning Formalization. Main Idea • Augment the notion of a conceptclassC with a notion of compatibility between a concept and the data distribution ((h,D) 2 [0,1]). • “Learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ). • Express relationships that one hopes the target function and underlying distribution will possess. Maria-Florina Balcan

  17. Semi-Supervised Learning Formalization. Main Idea • Augment the notion of a conceptclassC with a notion of compatibility between a concept and the data distribution ((h,D) 2 [0,1]). • “Learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ). • Express relationships that one hopes the target function and underlying distribution will possess. • Use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan

  18. Semi-Supervised Learning Formalization. Main Idea, cont • Use unlabeled data & our belief to reducesize(C) down to size(highly compatible functions in C) in our sample complexity bounds. • Need to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan

  19. _ + Highly compatible + _ Margins, Compatibility • Margins: belief is that there should exist a separator with margin . • (h, D) =1-(the probability mass within distance  of h). • can be can be estimated from a finite sample. Maria-Florina Balcan

  20. Types of Results in Our Model • As in the usual PAC model, can discuss algorithmic and sample complexity issues. • Can analyze how much unlabeled data we need to see: • depends both on the complexity of C and the complexity of our notion of compatibility. • Can analyze the ability of a finite unlabeled sample to reduce our dependence on labeled examples: • as a function of compatibility of the target function and various measures of the helpfulness of the distribution. Maria-Florina Balcan

  21. Examples of Results in Our Model • Algorithm: pick a compatible concept that agrees with the labeled sample. • Sample Complexity Statement: Maria-Florina Balcan

  22. + _ + Highly compatible _ Examples of Results in Our Model, cont. • Algorithm: pick a compatible concept that agrees with the labeled sample. • Sample Complexity Statement: Maria-Florina Balcan

  23. Summary • Provided a PAC style model for semi-supervised learning. • Captures many of the ways in which unlabeled data is typically used. • Unified framework for analyzing when and why unlabeled data can help. • Can get much better bounds in terms of labeled examples. Maria-Florina Balcan

  24. Thank you ! Maria-Florina Balcan

More Related