1 / 23

Analysis of greedy active learning

Analysis of greedy active learning. Sanjoy Dasgupta UC San Diego. Standard learning model. Given m labeled points, want to learn a classifier with misclassification rate <  , chosen from a hypothesis class H with VC dimension d < 1 .

abena
Download Presentation

Analysis of greedy active learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of greedy active learning Sanjoy Dasgupta UC San Diego

  2. Standard learning model Given m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/, in the realizable case.

  3. Active learning Unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

  4. Can adaptive querying help? Simple hypothesis class: threshold functions on the real line: hw(x) = 1(x ¸ w), H = {hw: w 2R} Start with m ¼ 1/unlabeled points Binary search – need just log m labels, from which the rest can be inferred! An exponential improvement in sample complexity.

  5. X1? X6? X8? Binary search X3 X6 X1 X8 m data points: there are effectively m+1 different hypotheses. Query tree has m+1 leaves, depth ¼ log m. Question: Is this a general phenomenon? For other hypothesis classes, is a generalized binary search possible?

  6. Bad news – I H = {linear separators in R1}: active learning reduces sample complexity from m to log m. But H = {linear separators in R2}: there are some target hypotheses for which all m labels need to be queried! (No matter how benign the input distribution.) In this case: learning to accuracy  requires 1/ labels…

  7. The benefit of averaging For linear separators in R2: In the worst case over target hypotheses, active learning offers no improvement in sample complexity. But there is a query tree in which the depths of the O(m2) target hypotheses are spread almost evenly over [log m, m]. The average depth is just log m. Question: does active learning help only in a Bayesian model?

  8. High  mass Low  mass Degrees of Bayesian-ity Prior  over hypotheses Pseudo-Bayesian model The prior is used only to count queries Bayesian model The prior is used for counting queries and also for the generalization bound Different stopping criteria. Suppose the remaining version space is:

  9. Effective hypothesis class Fix a hypothesis class H of VC dimension d < 1, and a set of unlabeled examples x1, x2, …, xm, where m ¸ d/. Sauer’s lemma: H can label these points in at most md different ways… the effective hypothesis class Heff = { (h(x1), h(x2), …, h(xm)) : h 2 H} has size |Heff| · md. Goal (in the realizable case): pick the element of Heff which is consistent with all the hidden labels, while asking for just a small subset of these labels.

  10. Model of an active learner Query tree: Each leaf is annotated with an element of Heff. Weights  over Heff. Goal: a tree T of small average depth, Q(T,) = h(h) ¢ depth(h) (can also use random coin flips at internal nodes) X1? X6? X8? X3? h6 h2 h3 h5 h1 Question: in this averaged model, can we always find a tree of depth o(m)?

  11. Bad news – II Pick any d > 0 and m ¸ 2d. There is an input space of size m and a hypothesis class H of VC dimension d such that (for uniform ) any active learning strategy requires ¸ m/8 queries on average. Choose: Input space = any {x1, …, xm} H = all concepts which are positive on exactly d inputs.

  12. A revised goal Depending on the choice of  the hypothesis class perhaps the input distribution the average number of labels needed by an optimal active learner is somewhere in the range [d log m, m]. Ideal case: d log m perfect binary search Worst case: m randomly chosen labels (within constants) Is there a generic active learning strategy which always achieves close to the optimal number of queries, no matter what it might be?

  13. Heuristics for active learning A common strategy in many heuristics: Greedy strategy. After seeing t labels, remaining version space is some Ht. Always choose the point which most evenly divides Ht, according to -mass. For instance, Tong-Koller (2000) – linear separators: Question: How good is this greedy scheme? And how does its performance depend on the choice of ? / volume

  14. Greedy active learning Choose any . How does the greedy query tree TG compare to the optimal tree T*? Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh(h)). Example: For uniform , the approximation ratio is log |Heff| · d log m. Lower bounds. [1] Uniform : we have an example in which Q(TG, ) ¸ Q(T*, ) ¢(log |Heff|/log log |Heff|) [2] Non-uniform : an example where  ranges between 1/2 and 1/2n, and Q(TG, ) ¸ Q(T*, ) ¢(n).

  15. Sub-optimality of greedy scheme [1] The case of uniform . There are simple examples in which the greedy scheme uses (log n/log log n) times the optimal number of labels. (a) The hypothesis class consists of several clusters (b) Each cluster is efficiently searchable (c) But first the version space must be narrowed down to one of these clusters: an inefficient process [Invoke this construction recursively.] Optimal strategy reduces entropy only gradually at first, then ramps it up later – an over-eager greedy scheme is fooled.

  16. h11 h21 h0 h12 h22 h13 h23 h1n h2n Sub-optimality, cont’d • [2] The case of general . • For any n ¸ 2: • There is a hypothesis class H of size 2n+1 and distribution  over H such that: •  ranges from 1/2 to 1/2n+1 • optimal expected number of queries is <3 • greedy strategy uses ¸ n/2 queries on average. H,  (proportional to area)

  17. Sub-optimality, cont’d Three types of queries: (i) Is target some h1i ? (ii) some h2i ? (iii) h1j or h2j ?

  18. Upper bound: overview Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh(h)). If the optimal tree is short, then either: there is a query which (in expectation) cuts off a good chunk of the version space or: some particular hypothesis has high weight. At least in the first case, the greedy scheme gets off to a good start [cf. Johnson’s argument for set cover].

  19. Quality of a query Need a notion of query quality which can only decrease with time. If S is a version space, and query xi splits it into S+, S-, we’ll say that “xi shrinks (S, )” by 2 (S+) (S-) (S) Claim: If xi shrinks (Heff, ) by , then it shrinks (S,) by at most , for any S µ Heff.

  20. When is the optimal tree short? Claim: Pick any S µ Heff, and any tree T whose leaves include all of S. Then there must be a query which shrinks (S, S) by at least: (1 – CP(S))/Q(T, S). Here: S is  restricted to S CP() = h(h)2 (collision probability)

  21. Main argument If the optimal tree has small average depth, then there are two possible cases: Case one: there is some query which shrinks the version space significantly In this case, the greedy strategy will find such a query and clear progress will be made. The resulting subtrees, considered together, will also require few queries.

  22. Proof, cont’d Case two: some classifier h* has very high -mass In this case, the version space might shrink by just an insignificant amount in one round. But: in roughly the number of queries that the optimal strategy requires for target h*, the greedy strategy will either eliminate h* or declare it to be the answer. In the former case, by the time h* is eliminated, the version space will have shrunk significantly. These two cases form the basis of an inductive argument.

  23. An open problem Just about the only positive result in active learning: [FSST97] Query by committee: if the data distribution is uniform over the unit sphere, can learn homogeneous linear separators using just O(d log 1/) labels. But the minute we allow non-homogeneous hyperplanes, the query complexity increases to 1/… What’s going on?

More Related