1 / 60

Probably Approximately Correct Learning (PAC)

Probably Approximately Correct Learning (PAC). Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142. Recall: Bayesian learning. Create a model based on some parameters Assume some prior distribution on those parameters Learning problem

peterwilson
Download Presentation

Probably Approximately Correct Learning (PAC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142

  2. Recall: Bayesian learning • Create a model based on some parameters • Assume some prior distribution on those parameters • Learning problem • Adjust the model parameters so as to maximize the likelihood of the model given the data • Utilize Bayesian formula for that.

  3. PAC Learning • Given distribution D of observables X • Given a family of functions (concepts) F • For each x ε X and f ε F: f(x) provides the label for x • Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Prx ε D[f(x) ≠ h(x)] is minimal

  4. PAC New Concepts • Large family of distributions D • Large family of concepts F • Family of hypothesis H • Main questions: • Is there a hypothesis h ε H that can be learned • How fast can it be learned • What is the error that can be expected

  5. Estimation vs. approximation • Note: • The distribution D is fixed • There is no noise in the system (currently) • F is a space of binary functions (concepts) • This is thus an approximation problem as the function is given exactly for each x X • Estimation problem: f is not given exactly but estimated from noisy data

  6. Example (PAC) • Concept: Average body-size person • Inputs: for each person: • height • weight • Sample: labeled examples of persons • label + : average body-size • label - : not average body-size • Two dimensional inputs

  7. Observable space X with concept f

  8. Example (PAC) • Assumption: target concept is a rectangle. • Goal: • Find a rectangle h that “approximates” the target • Hypothesis family H of rectangles • Formally: • With high probability • output a rectangle such that • its error is low.

  9. Example (Modeling) • Assume: • Fixed distribution over persons. • Goal: • Low error with respect to THIS distribution!!! • How does the distribution look like? • Highly complex. • Each parameter is not uniform. • Highly correlated.

  10. Model Based approach • First try to model the distribution. • Given a model of the distribution: • find an optimal decision rule. • Bayesian Learning

  11. PAC approach • Assume that the distribution is fixed. • Samples are drawn are i.i.d. • independent • Identically distributed • Concentrate on the decision rule rather than distribution.

  12. PAC Learning • Task: learn a rectangle from examples. • Input: point (x,f(x)) and classification + or - • classifies by a rectangle R • Goal: • Using the fewest examples • compute h • h is a good approximation for f

  13. PAC Learning: Accuracy • Testing the accuracy of a hypothesis: • using the distribution D of examples. • Error = h D f (symmetric difference) • Pr[Error] = D(Error) = D(h D h) • We would like Pr[Error] to be controllable. • Given a parameter e: • Find h such that Pr[Error] < e.

  14. PAC Learning: Hypothesis • Which Rectangle should we choose? • Similar to parametric modeling?

  15. Setting up the Analysis: • Choose smallest rectangle. • Need to show: • For any distribution D and Rectangle h • input parameters: e and d • Select m(e,d) examples • Let h be the smallest consistent rectangle. • Such that with probability 1-d (on X): D(f D h) < e

  16. More general case (no rectangle) • A distribution: D (unknown) • Target function: ct from C • ct : X  {0,1} • Hypothesis: h from H • h: X  {0,1} • Error probability: • error(h) = ProbD[h(x) ct(x)] • Oracle: EX(ct,D)

  17. PAC Learning: Definition • C and H are concept classes over X. • C is PAC learnable by H if • There Exist an Algorithm A such that: • For any distribution D over X and ct in C • for every input eand d: • outputs a hypothesis h in H, • while having access to EX(ct,D) • with probability 1-d we have error(h) < e • Complexities.

  18. Finite Concept class • Assume C=H and finite. • h is e-bad if error(h)> e. • Algorithm: • Sample a set S of m(e,d) examples. • Find h in H which is consistent. • Algorithm fails if h is e-bad.

  19. PAC learning: formalization (1) • X is the set of all possible examples • D is the distribution from which the examples are drawn • H is the set of all possible hypotheses, cH • m is the number of training examples • error(h) = Pr(h(x) c(x) | x is drawn from X with D) • h is approximately correctif error(h) 

  20. H  Hbad c PAC learning: formalization (2) To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an-ball around c.

  21. Complexity analysis (1) • The probability that hypothesis hbadHbad is consistent with the first m examples: error(hbad) >  by definition. The probability that it agrees with an example is thus (1- ) and with m examples(1- )m

  22. Complexity analysis (2) • For Hbad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(Hbad has a consistent hypothesis)  |Hbad|(1- )m  |H|(1- )m

  23. Complexity analysis (3) • To reduce the probability of error below  |H|(1- )m   • This is possible when at least m examples m  1/(ln 1/ + ln |H|) are seen. • This is the sample complexity

  24. Complexity analysis (4) • “at least m examples are necessary to build a consistent hypothesis h that is wrong at most  times with probability 1- ” • Since |H| = 22^n, the complexity grows exponentially with the number of attributes n • Conclusion: learning any boolean function is no better in the worst case than table lookup!

  25. PAC learning -- observations • “Hypothesis h(X) is consistent with m examples and has an error of at most  with probability 1-” • This is a worst-case analysis. Note that the result is independent of the distribution D! • Growth rate analysis: • for   0, m   proportionally • for   0, m   logarithmically • for |H| , m   logarithmically

  26. PAC: comments • We only assumed that examples are i.i.d. • We have two independent parameters: • Accuracy e • Confidence d • No assumption about the likelihood of a hypothesis. • Hypothesis is tested on the same distribution as the sample.

  27. PAC: non-feasible case • What happens if ct not in H • Needs to redefine the goal. • Let h* in H minimize the error b=error(h*) • Goal: find h in H such that error(h)  error(h*) +e = b+e

  28. Analysis* • For each h in H: • let obs-error(h) be the average error on the sample S. • Compute the probability that: Pr {|obs-error(h) - error(h) | < e/2} Chernoff bound: Pr < exp(-(e/2)2m) • Consider entire H : Pr < |H| exp(-(e/2)2m) • Sample size m > (4/e2) ln (|H|/d)

  29. Correctness • Assume that for all h in H: • |obs-error(h) - error(h) | < e/2 • In particular: • obs-error(h*) < error(h*) + e/2 • error(h) -e/2 < obs-error(h) • For the output h: • obs-error(h) < obs-error(h*) • Conclusion: error(h) < error(h*)+e

  30. Sample size issue • Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | < e/2} Chernoff bound: Pr < exp(-(e/2)2m) and on entire H : Pr < |H| exp(-(e/2)2m) • It follows that the sample size m > (4/e2) ln (|H|/d) Not (1/e) ln (|H|/d) as before

  31. Example: Learning OR of literals • Inputs: x1, … , xn • Literals : x1, • OR functions: • For each variable, target disjunction may contain xi or not, thus Number of disjunctions is 3n

  32. ELIM: Algorithm for learning OR • Keep a list of all literals • For every example whose classification is 0: • Erase all the literals that are 1. • Example c(00110)=0 results in deleting • Correctness: • Our hypothesis h: An OR of our set of literals. • Our set of literals includes the target OR literals. • Every time h predicts zero: we are correct. • Sample size: m > (1/e) ln (3n/d)

  33. Learning parity • Functions: x1  x7 x9 • Number of functions: 2n • Algorithm: • Sample set of examples • Solve linear equations (Matrix exists) • Sample size: m > (1/e) ln (2n/d)

  34. max min Infinite Concept class • X=[0,1] and H={cq | q in [0,1]} • cq(x) = 0 iff x < q • Assume C=H: • Which cq should we choose in [min,max]?

  35. Proof I • Define max = min{x|c(x)=1}, min = max{x|c(x)=0} • _Show that the probability that • Pr[ D([min,max]) > e ] < d • Proof: By Contradiction. • The probability that x in [min,max] at least e • The probability we do not sample from [min,max] Is (1-e)m • Needs m > (1/e) ln (1/d) There is something wrong

  36. Proof II (correct): • Let max’ be : D([q,max’])=e/2 • Let min’ be : D([q,min’])=e/2 • Goal: Show that with high probability • X+ in [max’,q] and • X- in [q,min’] • In such a case any value in [x-,x+] is good. • Compute sample size!

  37. Proof II (correct): • Pr{x1,x2,..,xm} is not in [q,min’])= (1- e/2)m < exp(-me/2) • Similarly with the other side • We require 2exp(-me/2) < δ • Thus, m > 2/e ln(2/δ)

  38. Comments • The hypothesis space was very simple H={cq | q in [0,1]} • There was no noise in the data or labels • So learning was trivial in some sense (analytic solution)

  39. Non-Feasible case: Label Noise • Suppose we sample: • Algorithm: • Find the function h with lowest error!

  40. Analysis • Define: zi as an e/4 - net (w.r.t. D) • For the optimal h* and our h there are • zj : |error(h[zj]) - error(h*)| < e/4 • zk : |error(h[zk]) - error(h)| < e/4 • Show that with high probability: • |obs-error(h[zi])-error(h[zi])| < e/4

  41. Exercise (Submission Mar 29, 04) • Assume there is Gaussian (0,σ) noise on xi. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.

  42. General e-net approach • Given a class H define a class G • For every h in H • There exist a g in G such that • D(g D h) < e/4 • Algorithm: Find the best h in H. • Computing the confidence and sample size.

  43. Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” • Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?

  44. Occam Razor Finding the shortest consistent hypothesis. • Definition: (a,b)-Occam algorithm • a >0 and b <1 • Input: a sample S of size m • Output: hypothesis h • for every (x,b) in S: h(x)=b (consistency) • size(h) < sizea(ct) mb • Efficiency.

  45. Occam algorithm and compression A B S (xi,bi) x1, … , xm

  46. Compression • Option 1: • A sends B the values b1 , … , bm • m bits of information • Option 2: • A sends B the hypothesis h • Occam: large enough m has size(h) < m • Option 3 (MDL): • A sends B a hypothesis h and “corrections” • complexity: size(h) + size(errors)

  47. Occam Razor Theorem • A: (a,b)-Occam algorithm for C using H • D distribution over inputs X • ct in C the target function, n=size(ct) • Sample size: • with probability 1-d A(S)=h has error(h) < e

  48. Occam Razor Theorem • Use the bound for finite hypothesis class. • Effective hypothesis class size 2size(h) • size(h) < na mb • Sample size: The VC dimension will replace 2size(h)

  49. Exercise (Submission Mar 29, 04) 2. For an (a,b)-Occam algorithm, given noisy data with noise ~ (0, σ2) find the limitations on m. Hint (ε-net and Chernoff bound)

  50. Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: • polynomial in k and log n • e and d constant • ELIM makes “slow” progress • disqualifies one literal per round • May remain with O(n) literals

More Related