1 / 46

Università di Milano-Bicocca Laurea Magistrale in Informatica

Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory. Computational models of cognitive phenomena. Computing capabilities: Computability theory

lamond
Download Presentation

Università di Milano-Bicocca Laurea Magistrale in Informatica

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Università di Milano-BicoccaLaurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory

  2. Computational models of cognitive phenomena • Computing capabilities:Computability theory • Reasoning/deduction:Formal logic • Learning/induction:?

  3. A theory of the learnable (Valiant ‘84) • […] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn […] Learning machines must have all 3 of the following properties: • the machines can provably learn whole classes of concepts, these classes can be characterized • the classes of concepts are appropriate and nontrivial for general-purpose knowledge • the computational process by which the machine builds the desired programs requires a “feasible” (i.e. polynomial) number of steps

  4. A theory of the learnable • We seek general laws that constrain inductive learning, relating: • Probability of successful learning • Number of training examples • Complexity of hypothesis space • Accuracy to which target concept is approximated • Manner in which training examples are presented

  5. Probably approximately correct learning formal computational model which want shed light on the limits of what can be learned by a machine, analysing the computational cost of learning algorithms

  6. What we want to learn CONCEPT =recognizing algorithmLEARNING =computational description of recognizing algorithms starting from: - examples - incomplete specifications That is: to determine uniformly good approximations of an unknown function from its value in some sample points • interpolation • pattern matching • concept learning

  7. What’snew in p.a.c. learning? Accuracy of resultsand running timefor learning algorithmsare explicitly quantified and related A general problem:use of resources (time, space…) by computations COMPLEXITY THEORY Example Sorting: n·logn time (polynomial, feasible) Bool. satisfiability: 2ⁿ time (exponential, intractable)

  8. Learningfromexamples LEARNER DOMAIN Concept EXAMPLES A REPRESENTATION OF A CONCEPT CONCEPT: subset of domain EXAMPLES: elements of concept (positive) REPRESENTATION: domain→expressions GOOD LEARNER ? EFFICIENT LEARNER ?

  9. The P.A.C. model • A domain X (e.g. {0,1}ⁿ, Rⁿ) • A concept: subset of X, f ⊆ Xor f:X→{0,1} • A class of concepts F⊆2X • A probability distribution P on X Example 1 X ≡ a square F≡ triangles in the square

  10. The P.A.C. model Example 2 X≡{0,1}ⁿF≡ family of boolean functions 1 if there are at least r ones in (x1,…,xn) fr(x1,…,xn) = 0 otherwise P a probability distribution on X Uniform Non uniform

  11. TheP.A.C.model The learning process • Labeled sample((x0, f(x0)), (x1, f(x1)), …, (xn, f(xn)) • Hypothesisa function h consistent with the sample (i.e., h(xi) = f(xi) i) • Error probabilityPerr(h(x)≠f(x), xX)

  12. TheP.A.C.model X, F X, fF LEARNER Examples generator with probability distribution p t examples Inference procedure A TEACHER (x1,f(x1)), … , (xt,f(xt))) The learning algorithm A is good if the hypothesis h is “ALMOST ALWAYS” “CLOSE TO” the target concept c Hypothesis h (implicit representation of a concept)

  13. TheP.A.C.model “CLOSE TO” x random choice METRIC : given P f dp(f,h) = Perr = Px  f(x)≠h(x) h Given an approximation parameter  (0<≤1), h is an ε-approximation of f if dp(f,h)≤ “ALMOST ALWAYS” Confidence parameter  (0 <  ≤ 1) The “measure” of sequences of examples, randomly choosen according to P, such that h is an ε-approximation of f is at least 1-

  14. Learning algorithm Generator of examples Learner h Fconcept class S set of labeled samples from a concept in F A: S  Fsuch that: 0<,<1 fF mN S s.t. |S|≥m I) A(S) consistent with S II)P(Perr< ) > 1-

  15. The efficiency issue Look for algorithms which use “reasonable” amount of computational resources COMPUTATIONAL RESOURCES SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning) DEF 1: a concept class F = n=1Fn is statistically PAC learnable if there is a learning algorithm with sample size t = t(n, 1/, 1/) bounded by some polynomial function in n, 1/, 1/

  16. The efficiency issue POLYNOMIAL PAC STATISTICAL PAC DEF 2: a concept class F = n=1Fn is polynomially PAC learnable if there is a learning algorithm with running time bounded by some polynomial function in n, 1/, 1/

  17. Learning boolean functions n = {f: {0, 1}n  {0, 1}} The set of boolean functions in n variables Fn  nA class of concepts Example 1: Fn = clauses with literals in Example 2: Fn = linearly separable functions in n variables REPRESENTATION - TRUTH TABLE (EXPLICIT) - BOOLEAN CIRCUITS (IMPLICIT) BOOLEAN CIRCUITS BOOLEAN FUNCTIONS

  18. Boolean functions and circuits • BASIC OPERATIONS • COMPOSITION [f(g1, … , gm)](x) = f(g1(x), … , gm(x)) in m variables in n variables CIRCUIT: Finite acyclic directed graph Output node Basic operations Input nodes Given an assignment {x1 … xn}  {0, 1} to the input variables, the output node computes the corresponding value

  19. Boolean functions and circuits Fn  n Cn : class of circuits which compute all and only the functions inFn Algorithm A to learn F by C • INPUT (n,ε,δ) • The learner computes t = t(n, 1/, 1/) • (t=number of examples sufficient to learn with accuracy ε and confidence δ) • The learner asks the teacher for a labelled t-sample • The learner receives the t-sample S and computes C = An(S) • Output C (C= representation of the hypothesis) Note that the inference procedure A receives as input the integer n and a t-sample on 0,1nand outputsAn(S) = A(n, S)

  20. Boolean functions and circuits An algorithm A is a learning algorithm with sample size t(n, 1/, 1/) for a concept class using the class of representations If for all n≥1, for all fFn, for all 0<, <1 and for every probability distribution p over 0,1n the following holds: If the inference procedureAnreceives as input a t-sample, it outputs a representationcCnof a functiongthat is probably approximately correct, that is with probability at least 1-a t-sample is chosen such that the function ginferred satisfies P{x  f(x)≠g(x)} ≤  g is –good: g is an –approximation of f g is –bad: g is not an –approximation of f NOTE: distribution free

  21. Statistical P.A.C. learning PROBLEM: Estimate upper and lower bounds on the sample size t = t(n, 1/, 1/) Upper bounds will be given forconsistent algorithmsLower bounds will be given forarbitrary algorithms DEF:An inference procedureAnfor the classFnis consistent if, given the targetfunctionfFn, for every t-sample S = (<x1,b1>, … , <xt,bt>), An(S) is a representation of a functiong“consistent” withS, i.e. g(x1) = b1, … , g(xt) = bt DEF:A learning algorithm A is consistent if its inference procedure is consistent

  22. A simple upper bound THEOREM: t(n, 1/, 1/) ≤ -1ln(#Fn) +ln(1/) PROOF: Prob(x1, … , xt) g (g(x1)=f(x1), … , g(xt)=f(xt)  g -bad) ≤ P(AUB)≤P(A)+P(B) ≤ Prob (g(x1) = f(x1), … , g(xt) = f(xt)) ≤ g ε-bad Independent events ≤ Prob (g(xi) = f(xi)) ≤ g ε-bad i=1, … , t g is ε-bad ≤  (1-)t ≤ #Fn(1-)t ≤ #Fne-t g ε-bad Impose #Fn e-t ≤  - #Fn must be finite NOTE

  23. Vapnik-Chervonenkis approach (1971) Problem: uniform convergence of relative frequencies to their probabilities Xdomain F2Xclass of concepts S = (x1, … , xt)t-sample f S g iff f(xi) = g(xi)  xi  Sundistinguishable by S F (S) = #(F /S) index of F w.r.t. S S1  S2 MF (t) = maxF(S)  S is a t-samplegrowth function

  24. A general upper bound THEOREM Prob(x1, … , xt) g (g -bad  g(x1) = f(x1), … , g(xt) = f(xt))≤ 2mF2te-t/2 FACT mF(t) ≤ 2t mF(t) ≤ #F (this condition gives immediately the simple upper bound) mF(t) = 2t and j<t  mF(j) = 2j

  25. Graph of the growth function #F ? d t DEFINITION d = VCdim(F) = max t  mF(t) = 2t FUNDAMENTAL PROPERTY BOUNDED BY A POLYNOMIAL IN t !

  26. Upper and lower bounds THEOREM If dn = VCdim(Fn) then t(n, 1/, 1/) ≤ max (4/ log(2/), (8dn/)log(13/) PROOF Impose2mFn2te-et/2 ≤  Number of examples which are necessary for arbitrary algorithms A lower bound ont(n, 1/, 1/): THEOREM For 0≤≤1/ and ≤1/100 t(n, 1/, 1/) ≥ max ((1-  )/ ln(1/), (dn-1)/32)

  27. An equivalent definition of VCdim F(S) = #(f-1(1)(x1, … , xt) | fF) I.e. the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F If F(S) = 2S we say that S is shattered by F The Vapnik-Chervonenkis dimension ofF is the cardinality of the largest finite set of points S  X that is shattered by F

  28. Example 1 Learn the family f of circles contained in the square Sufficient! 24.000 690 Necessary!

  29. Example 2 Learn the family of linearly separable boolean functions in n variables, Ln HS(x)= SIMPLE UPPER BOUND UPPER BOUND USING GROWS LINEARLY WITH n!

  30. Example 2 Consider the class L2 of linearly separable functions in two variables No straight line can separate the green from the red points The green point cannot be separated from the other three

  31. Classi di formule booleane Monomi x1x2 … xk DNF m1m2 … mj (mj monomi) Clausole x1x2 … xk CNF c1c2 … cj (cj clausole) k-DNF ≤ k letterali nei monomi k-term-DNF ≤ k monomi k-CNF ≤ k letterali nelle clausole k-clause-CNF ≤ k clausole Formule monotone non contengono letterali negati -formule ogni variabile appare al più una volta

  32. I risultati Th. (Valiant) I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo in tutti gli es. in tutti gli es. N.B. L’apprendibilità è non monotona A B se B appr., allora A appr. Th.i monomi non sono apprendibili da esempi negativi

  33. Risultati positivi 1) K-CNF apprendibili da soli esempi positivi 1b) K-DNF apprendibili da soli esempi negativi 2) (K-DNF K-CNF) apprendibili da es. (K-DNF K-CNF) positivi e negativi 3) la classe delle K-decision lists è apprendibile Th. Ogni K-DNF (o K-CNF)-formula è rappresentabile da una K-DL piccola

  34. Risultati negativi 1) Le -formule non sono apprendibili 2) Le funzioni booleane a soglia non sono apprendibili 3) Per K ≥ 2, le formule K-term-DNF non sono apprendibili

  35. Mistake bound model • So far: how many examples needed to learn ? • What about: how many mistakes before convergence ? • Let’s consider similar setting to PAC learning: • Instances drawn at random from X according to distribution D • Learner must classify each instance before receiving correct classification from teacher • Can we bound the number of mistakes learner makes before converging ?

  36. Mistake bound model • Learner: • Receives a sequence of training examples x • Predicts the target value f(x) • Receives the correct target value from the trainer • Is evaluated by the total number of mistakes it makes before converging to the correct hypothesis • I.e.: • Learning takes place during the use of the system, not off-line • Ex.: prediction of fraudolent use of credit cards

  37. Mistake bound for Find-S • Consider Find-S when H = conjunction of boolean literals FIND-S: • Initialize h to the most specific hypothesis in H: x1x1x2x2 … xnxn • For each positive training instance x • Remove from h any literal not satisfied by x • Output h

  38. Mistake bound for Find-S • If C  Hand training data noise free, Find-S converges to an exact hypothesis • How many errors to learncH(only positive examples can be misclassified)? • The first positive example will be misclassified, and n literals in the initial hypothesis will be eliminated • Each subsequent error eliminates at least one literal • #mistakes ≤ n+1 (worst case, for the “total” concept x c(x)=1)

  39. Mistake bound for Halving • A version space is maintained and refined (e.g., Candidate-elimination) • Prediction is based on majority vote among the hypotheses in the current version space • “Wrong” hypotheses are removed (even if x is exactly classified) • How many errors to exactly learn cH (H finite)? • Mistake when the majority of hypotheses misclassifies x • These hypotheses are removed • For each mistake, the version space is at least halved • At most log2(|H|) mistakes before exact learning (e.g., single hypothesis remaining) • Note: learning without mistakes possible !

  40. Optimal mistake bound • Question: what is the optimal mistake bound (i.e., lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C, assuming H=C ? • Formally, for any learning algorithm A and any target concept c: • MA(c) = max #mistakes made by A to exactly learn c over all possible training sequences • MA(C) = maxcC MA(c) Note: Mfind-S(C) = n+1 MHalving(C) ≤ log2(|C|) • Opt(C) = minA MA(C) i.e., # of mistakes made for the hardest target concept in C, using the hardest training sequence, by the best algorithm

  41. Optimal mistake bound • Theorem (Littlestone 1987) • VC(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|) • There exist concept classes for which VC(C) = Opt(C) = MHalving(C) = log2(|C|) e.g., the power set 2X of X, for which it holds: VC(2X) = |X| = log2(|2X|) • There exist concept classes for which • VC(C) < Opt(C) < MHalving(C)

  42. Weighted majority algorithm • Generalizes Halving • Makes predictions by taking a weighted vote among a pole of prediction algorithms • Learns by altering the weight associated with each prediction algorithm • It does not eliminate hypotheses (i.e., algorithms) inconsistent with some training examples, but just reduces its weight, so is able to accommodate inconsistent training data

  43. Weighted majority algorithm • i wi := 1 •  training example (x, c(x)) • q0 := q1 := 0 •  prediction algorithm ai • If ai(x)=0 then q0 := q0 + wi • If ai(x)=1 then q1 := q1 + wi • if q1 > q0 then predict c(x)=1 if q1 < q0 then predict c(x)=0 if q1 > q0 then predict c(x)=0 or 1 at random •  prediction algorithm ai do If ai(x)≠c(x) then wi := wi (0≤<1)

  44. Weighted majority algorithm (WM) • Coincides with Halving for =0 • Theorem - D any sequence of training examples, A any set of n prediction algorithms, k min # of mistakes made by any ajA for D, =1/2. Then W-M makes at most 2.4(k+log2n) mistakes over D

  45. Weighted majority algorithm (WM) • Proof • Since aj makes k mistakes (best in A) its final weight wj will be (1/2)k • The sum W of the weights associated with all n algorithms in A is initially n, and for each mistake made by WM is reduced to at most (3/4)W, because the “wrong” algorithms hold at least 1/2 of total weight, that will be reduced by a factor of 1/2. • The final total weight W is at most n(3/4)M, where M is the total number of mistakes made by WM over D.

  46. Weighted majority algorithm (WM) • But the final weight wj cannot be greater than the final total weight W, hence: (1/2)k ≤ n(3/4)M from which M ≤ ≤ 2.4(k+log2n) • I.e., the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the size of the pool (k+log2 n) -log2 (3/4)

More Related