1 / 25

Efficient classification for metric data

Efficient classification for metric data. Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A. Classification problem. +1. -1.

griffin
Download Presentation

Efficient classification for metric data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient classification for metric data Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

  2. Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 2 Efficient classification for metric data

  3. Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 3 Efficient classification for metric data

  4. Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 4 Efficient classification for metric data

  5. Generalization bounds • How do we upper bound the true error? • Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n • More complex classifier ↔ “easier” to fit to arbitrary data • VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5

  6. Popular approach for classification • Assume the points are in Euclidean space! • Pros • Existence of inner product • Efficient algorithms (SVM) • Good generalization bounds (max margin) • Cons • Many natural settings non-Euclidean • Euclidean structure is a strong assumption • Recent popular focus • Metric space data Efficient classification for metric data

  7. Metric space חיפה • (X,d) is a metric space if • X = set of points • d() = distance function • nonnegative • symmetric • triangle inequality • inner product → norm • norm → metric • But ⇐ doesn’t hold 95km תל אביב 208km 113km באר שבע Efficient classification for metric data

  8. Classification for metric data? • Advantage: often much more natural • much weaker assumption • strings • Images (earthmover distance) • Problem: no vector representation • No notion of dot-product (and no kernel) • What to do? • Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! • Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data

  9. Preliminaries: Lipschitz constant • The Lipschitz constantL of a function f: X → R measures its smoothness • It is the smallest value L that satisfies for all points xi,xj in X • Denoted by • Suppose hypothesis h: S → {-1,1} is consistent with sample S • Its Lipschitz constant of h is determined by the closest pair of differently labeled points • Or equivalently ≥ 2/d(S+,S−) -1 +1 Efficient classification for metric data

  10. Preliminaries: Lipschitz extension Lipschitz extension: A classic problem in Analysis given a function f: S → Rfor S inX, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman 10 Efficient classification for metric data

  11. Classification for metric data • A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04) • Construction of h onS: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions • Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h • Lipschitz extension problem • For example f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S • Evaluation of h reduces to exact Nearest Neighbor Search • Strong theoretical motivation for the NNS classification heuristic Efficient classification for metric data

  12. Two new directions • The framework of [vLB ‘04] leaves open two further questions: • Constructing h: handling noise • Bias-Variance tradeoff • Which sample points in S should h ignore? • Evaluating h on X • In arbitrary metric space, exact NNS requires Θ(n) time • Can we do better? -1 +1 q ~1 ~1 Efficient classification for metric data

  13. Doubling dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value >0such that every ball can be covered by balls of half the radius • First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. • The doubling dimension is ddim(M)=log2(M) • A metric is doubling if its doubling dimension is constant • Euclidean: ddim(Rd) = O(d) • Packing property of doubling spaces • A set with diameter diam and minimum inter-point distance a, contains at most (diam/a)O(ddim)points Here ≥7. Efficient classification for metric data

  14. Applications of doubling dimension • Major application to databases • Recall that exact NNS requires Θ(n) time in arbitrary metric space • There exists a linear size structure that supports approximate nearest neighbor search in time 2O(ddim) log n • Database/network structures and tasks analyzed via the doubling dimension • Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] • Image recognition (Vision) [KG --] • Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] • Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] • Clustering [Tal ‘04, ABS ‘08, FM ‘10] • Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] • Further applications • Travelling Salesperson [Tal ‘04] • Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] • Machine learning [BLL ‘09, KKL ‘10, KKL --] • Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] • Message: This is an active line of research… 16

  15. Our dual use of doubling dimension • Interestingly, considering the doubling dimension yields contributes in two different areas • Statistical: Function complexity • We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h • Computational: efficient approximate NNS Efficient classification for metric data

  16. Statistical contribution • We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension • vLB provided similar bounds using covering numbers and Rademacher averages • Fat-shattering analysis: • L-Lipschitz functions shatter a set → inter-point distance is at least 2/L • Packing property → set has (diam L)O(ddim) points • This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity. Efficient classification for metric data

  17. Statistical contribution • [BST ‘99]: • For any f that classifies a sample of size n correctly, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) . • Likewise, if f is correct on all but k examples, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2. • In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L)ddim + 1 • Done with the statistical contribution … On to the computational contribution. Efficient classification for metric data

  18. Computational contribution • Evaluation of h for new points in X • Lipschitz extension function • f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] • Requires exact nearest neighbor search, which can be expensive! • New tool: (1+)-approximate nearest neighbor search • 2O(ddim) log n + O(-ddim) time • [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] • If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of • g(x) = (1+) f(x) +  • e(x) = (1+) f(x) -  • Note that g(x) ≥ f(x) ≥ e(x) • g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well 2 g(x) f(x) e(x) Efficient classification for metric data

  19. Final problem: bias variance tradeoff • Which sample points in S should h ignore? • If f is correct on all but k examples, we have with probability at least 1− • P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. • Where d ≤ (diam L)ddim + 1 -1 +1 Efficient classification for metric data

  20. Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible +1 -1 Efficient classification for metric data

  21. Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time +1 -1 Efficient classification for metric data

  22. Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time • Minimum vertex cover on a bipartite graph • Equivalent to maximum matching (Konig’s theorem) • Admits an exact solution in O(n2.376) randomized time [MS ‘04] +1 -1 Efficient classification for metric data

  23. Efficient SRM • Algorithm: • For each of O(n2) values of L • Run matching algorithm to find minimum error • Evaluate generalization bound for this value of L • O(n4.376) randomized time • Better algorithm • Binary search over O(n2) values of L • For each value • Run greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of L Efficient classification for metric data

  24. Conclusion • Results: • Generalization bounds for Lipschitz classifiers in doubling spaces • Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS • Efficient Structural Risk Minimization • Continuing research: Continuous labels • Risk bound via the doubling dimension • Classifier h determined via an LP • Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints. Efficient classification for metric data

  25. Application: earthmover distance S T Efficient classification for metric data

More Related