learning with similarity functions n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning with Similarity Functions PowerPoint Presentation
Download Presentation
Learning with Similarity Functions

Loading in 2 Seconds...

play fullscreen
1 / 16

Learning with Similarity Functions - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Learning with Similarity Functions. Maria-Florina Balcan & Avrim Blum CMU, CSD. Kernels and Similarity Functions. Kernels have become a powerful tool in ML. Useful in practice for dealing with many different kinds of data.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning with Similarity Functions' - fagan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning with similarity functions

Learning with Similarity Functions

Maria-Florina Balcan & Avrim Blum

CMU, CSD

Maria-Florina Balcan

kernels and similarity functions
Kernels and Similarity Functions

Kernels have become a powerful tool in ML.

  • Useful in practice for dealing with many different kinds of data.
  • Elegant theory about what makes a given kernel good for a given learning problem.

Our Goal: analyze more general similarity functions.

  • In the process we describe ways of constructing good data dependent kernels.

Maria-Florina Balcan

kernels

(x)

1

w

Kernels
  • A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=(x) ¢(y).
  • Point is: many learning algorithms can be written so only interact with data via dot-products.
  • If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space.
  • If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time.

If margin  in -space, only need 1/2 examples to learn well.

Maria-Florina Balcan

general similarity functions
General Similarity Functions

Goal:definition ofgood similarity functionfor a learning problem that:

1) Talks in terms of natural direct properties:

  • no implicit high-dimensional spaces
  • no requirement of positive-semidefiniteness

2) If K satisfies these properties for our given problem, then has implications to learning.

3) Is broad: includes usual notion of “good kernel”.

(induces a large margin separator in -space)

Maria-Florina Balcan

a first attempt definition satisfying properties 1 and 2

-

B

C

-

A

+

A First Attempt: Definition satisfying properties (1) and (2)

Let P be a distribution over labeled examples (x, l(x))

  • K:(x,y) ! [-1,1] is an (,)-good similarity for P if at leasta 1-probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

  • Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1].

Note: this might not be a legal kernel.

Maria-Florina Balcan

a first attempt definition satisfying properties 1 and 2 how to use it
A First Attempt: Definition satisfying properties (1) and (2). How to use it?
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Algorithm

  • Draw S+ of O((1/2) ln(1/2)) positive examples.
  • Draw S- of O((1/2) ln(1/2)) negative examples.
  • Classify x based on which gives better score.

Maria-Florina Balcan

a first attempt how to use it
A First Attempt: How to use it?
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass ofx satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Algorithm

  • Draw S+ of O((1/2) ln(1/2)) positive examples.
  • Draw S- of O((1/2) ln(1/2)) negative examples.
  • Classify x based on which gives better score.

Guarantee: with probability ¸1-, error · + .

Proof

  • Hoeffding: for any given “goodx”, probability of error w.r.t. x (over draw of S+, S-) at most 2.
  • By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate · + .

Maria-Florina Balcan

a first attempt not broad enough

more similar to negs than to typical pos

+

+

+

+

+

+

-

-

-

-

-

-

A First Attempt: Not Broad Enough
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

  • K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition.

Maria-Florina Balcan

a first attempt not broad enough1
A First Attempt: Not Broad Enough
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if at leasta 1-probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

R

+

+

+

+

+

+

-

-

-

-

-

-

Idea: would work if we didn’t pick y’s rom top-left.

Broaden to say:OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.

Maria-Florina Balcan

broader main definition
Broader/Main Definition
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1]at leasta 1-probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Maria-Florina Balcan

main definition how to use it
Main Definition, How to Use It
  • K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Algorithm

  • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
  • Use to “triangulate” data:

F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

  • Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators.

Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4.

(w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])

Maria-Florina Balcan

main definition implications
Main Definition, Implications

Algorithm

  • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
  • Use to “triangulate” data:

F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4.

legal kernel

Implications

K arbitrary sim. function

(,)-goodsim. function

(+,/4)-goodkernelfunction

Maria-Florina Balcan

good kernels are good similarity functions
Good Kernels are Good Similarity Functions

Main Definition: K:(x,y) ! [-1,1] is an(,)-good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1-probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Theorem

  • An (,)-good kernel is an (’,’)-good similarity function under main definition.

Our current proofs incur some penalty:

’ =  + extra, ’ = 3extra.

Maria-Florina Balcan

good kernels are good similarity functions1
Good Kernels are Good Similarity Functions

Theorem

  • An (,)-good kernel is an (’,’)-good similarity function under main definition, where

’ =  + extra, ’ = 3extra.

Proof Sketch

  • Suppose K is a good kernel in usual sense.
  • Then, standard margin bounds imply:
    • if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF.
  • But, we want sample-independent weights [and bounded].
  • Boundedness not too hard (imagine a margin-perceptron run over just the good y).
  • Get sample-independence using an averaging argument.

Maria-Florina Balcan

learning with multiple similarity functions

Sample complexity is roughly

Learning with Multiple Similarity Functions
  • Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

Algorithm

  • Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
  • Use to “triangulate” data:

F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].

Guarantee: The induced distribution F(P) in R2dr has a separator of error · +  at margin at least

Maria-Florina Balcan

implications conclusions
Implications & Conclusions
  • Develop theory that provides a formal way of understanding kernels as similarity function.
  • Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric).

Open Problems

  • Improve existing bounds.
  • Better results for learning with multiple similarity functions. Extending [SB’06].

Maria-Florina Balcan