- 59 Views
- Uploaded on
- Presentation posted in: General

Similarity-based Classifiers: Problems and Solutions

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Similarity-based Classifiers:Problems and Solutions

Van Gogh

Monet

Van Gogh

Or

Monet ?

(paintings)

(painter)

?

Computational Biology

- Smith-Waterman algorithm (Smith & Waterman, 1981)
- FASTA algorithm (Lipman & Pearson, 1985)
- BLAST algorithm (Altschul et al., 1990)
Computer Vision

- Tangent distance (Duda et al., 2001)
- Earth mover’s distance (Rubner et al., 2000)
- Shape matching distance (Belongie et al., 2002)
- Pyramid match kernel (Grauman & Darrell, 2007)
Information Retrieval

- Levenshtein distance (Levenshtein, 1966)
- Cosine similarity between tf-idf vectors (Manning & Schütze, 1999)

96 books

96 books

96 books

96 books

96 books

Eigenvalues

Rank

96 books

0

0

0

0

0

0

Flip, Clip or Shift?

Best bet is Clip.

0

0

Learn the best kernel matrix for the SVM:

(Luss NIPS 2007, Chen et al. ICML 2009)

- SVM (Graepel et al., 1998; Liao & Noble, 2003)
- Linear programming (LP) machine (Graepel et al., 1999)
- Linear discriminant analysis (LDA) (Pekalska et al., 2001)
- Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002)
- Potential support vector machine (P-SVM) (Hochreiter & Obermayer, 2006; Knebel et al., 2008)

Take a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

?

Take a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

?

?

Design Goal 1 (Affinity):wi should be an increasing function of ψ(x, xi).

?

?

Design Goal 2 (Diversity):wi should be a decreasing function of ψ(xi, xj).

Linear interpolation weights will meet these goals:

Linear interpolation weights will meet these goals:

Linear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

Linear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

Linear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

Linear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

regularizes the variance of the weights

only need inner products – can replace with kernel or similarities!

Kernel ridge interpolation (KRI) weights:

Kernel ridge interpolation (KRI) weights:

affinity:

Kernel ridge interpolation (KRI) weights:

diversity:

Kernel ridge interpolation (KRI) weights:

Kernel ridge interpolation (KRI) weights:

Remove the constraints on the weights:

Can show equivalent to local ridge regression:

KRR weights.

KRI weights

KRR weights

KRI weights

KRR weights

KRI weights

KRR weights

Reg. Local SDA

Performance:

Competitive

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

Making S PSD.

Fast k-NN search for similarities

Similarity-based regression

Relationship with learning on graphs

Try it out on real data

Fusion with Euclidean features (see our FUSION 2009 papers)

Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)

Code/Data/Papers: idl.ee.washington.edu/similaritylearningSimilarity-based Classification by Chen et al., JMLR 2009

For a test sample x, given , shall we classify x as

No! If a training sample was used as a test sample, could change its class!

Amazon

Aural Sonar

Protein

Eigenvalue

Eigenvalue

Eigenvalue

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Voting

Yeast-5-7

Yeast-5-12

Eigenvalue

Eigenvalue

Eigenvalue

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Empirical risk minimization (ERM) with regularization:

Hinge loss:

SVM Primal:

Find for classification the best K regularized toward S:

SVM that learns the full kernel matrix:

SVM Dual:

Robust SVM (Luss & d’Aspremont, 2007):

“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

Let

Rewrite the robust SVM as

Theorem (Sion, 1958)

Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then

Let

Rewrite the robust SVM as

By Sion’sminimax theorem, the robust SVM is equivalent to:

zero duality gap

Compare

It is not trivial to directly solve:

Lemma (Generalized Schur Complement)

Let , and . Then

if and only if , z is in the range of K, and .

Let , and notice that since .

It is not trivial to directly solve:

However, it can be expressed as a convex conic program:

- We can recover the optimal by .

Concerns about learning the full kernel matrix:

- Though the problem is convex, the number of variables is O(n2).
- The flexibility of the model may lead to overfitting.