1 / 16

Learning with Similarity Functions

Learning with Similarity Functions. Maria-Florina Balcan & Avrim Blum CMU, CSD. Kernels and Similarity Functions. Kernels have become a powerful tool in ML. Useful in practice for dealing with many different kinds of data.

fagan
Download Presentation

Learning with Similarity Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD Maria-Florina Balcan

  2. Kernels and Similarity Functions Kernels have become a powerful tool in ML. • Useful in practice for dealing with many different kinds of data. • Elegant theory about what makes a given kernel good for a given learning problem. Our Goal: analyze more general similarity functions. • In the process we describe ways of constructing good data dependent kernels. Maria-Florina Balcan

  3. (x) 1 w Kernels • A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=(x) ¢(y). • Point is: many learning algorithms can be written so only interact with data via dot-products. • If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space. • If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time. If margin  in -space, only need 1/2 examples to learn well. Maria-Florina Balcan

  4. General Similarity Functions Goal:definition ofgood similarity functionfor a learning problem that: 1) Talks in terms of natural direct properties: • no implicit high-dimensional spaces • no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in -space) Maria-Florina Balcan

  5. - B C - A + A First Attempt: Definition satisfying properties (1) and (2) Let P be a distribution over labeled examples (x, l(x)) • K:(x,y) ! [-1,1] is an (,)-good similarity for P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Note: this might not be a legal kernel. Maria-Florina Balcan

  6. A First Attempt: Definition satisfying properties (1) and (2). How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Maria-Florina Balcan

  7. A First Attempt: How to use it? • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass ofx satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm • Draw S+ of O((1/2) ln(1/2)) positive examples. • Draw S- of O((1/2) ln(1/2)) negative examples. • Classify x based on which gives better score. Guarantee: with probability ¸1-, error · + . Proof • Hoeffding: for any given “goodx”, probability of error w.r.t. x (over draw of S+, S-) at most 2. • By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate · + . Maria-Florina Balcan

  8. more similar to negs than to typical pos + + + + + + - - - - - - A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ • K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition. Maria-Florina Balcan

  9. A First Attempt: Not Broad Enough • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + + + + + + - - - - - - Idea: would work if we didn’t pick y’s rom top-left. Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label. Maria-Florina Balcan

  10. Broader/Main Definition • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1]at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Maria-Florina Balcan

  11. Main Definition, How to Use It • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. • Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4. (w = [w(y1), …,w(yd),-w(zd),…,-w(zd)]) Maria-Florina Balcan

  12. Main Definition, Implications Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4. legal kernel Implications K arbitrary sim. function (,)-goodsim. function (+,/4)-goodkernelfunction Maria-Florina Balcan

  13. Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition. Our current proofs incur some penalty: ’ =  + extra, ’ = 3extra. Maria-Florina Balcan

  14. Good Kernels are Good Similarity Functions Theorem • An (,)-good kernel is an (’,’)-good similarity function under main definition, where ’ =  + extra, ’ = 3extra. Proof Sketch • Suppose K is a good kernel in usual sense. • Then, standard margin bounds imply: • if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. • But, we want sample-independent weights [and bounded]. • Boundedness not too hard (imagine a margin-perceptron run over just the good y). • Get sample-independence using an averaging argument. Maria-Florina Balcan

  15. Sample complexity is roughly Learning with Multiple Similarity Functions • Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good. Algorithm • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). • Use to “triangulate” data: F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)]. Guarantee: The induced distribution F(P) in R2dr has a separator of error · +  at margin at least Maria-Florina Balcan

  16. Implications & Conclusions • Develop theory that provides a formal way of understanding kernels as similarity function. • Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Problems • Improve existing bounds. • Better results for learning with multiple similarity functions. Extending [SB’06]. Maria-Florina Balcan

More Related