introduction to machine learning l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Machine Learning PowerPoint Presentation
Download Presentation
Introduction to Machine Learning

Loading in 2 Seconds...

play fullscreen
1 / 120

Introduction to Machine Learning - PowerPoint PPT Presentation


  • 274 Views
  • Uploaded on

Introduction to Machine Learning. Manik Varma Microsoft Research India http://research.microsoft.com/~manik manik@microsoft.com. Binary Classification. Is this person Madhubala or not? Is this person male or female? Is this person beautiful or not?. Multi-Class Classification.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Machine Learning' - bernad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to machine learning

Introduction to Machine Learning

Manik Varma

Microsoft Research India

http://research.microsoft.com/~manik

manik@microsoft.com

binary classification

Binary Classification

  • Is this person Madhubala or not?
  • Is this person male or female?
  • Is this person beautiful or not?
multi class classification

Multi-Class Classification

  • Is this person Madhubala, Lalu or Rakhi Sawant?
  • Is this person happy, sad, angry or bemused?
ordinal regression

Ordinal Regression

  • Is this person very beautiful, beautiful, ordinary or ugly?
regression

Regression

  • How beautiful is this person on a continuous scale of 1 to 10? 9.99?
ranking

Ranking

  • Rank these people in decreasing order of attractiveness.
multi label classification

Multi-Label Classification

  • Tag this image with the set of relevant labels from {female, Madhubala, beautiful, IITD faculty}
are these problems distinct

Can regression solve all these problems

    • Binary classification – predict p(y=1|x)
    • Multi-Class classification – predict p(y=k|x)
    • Ordinal regression – predict p(y=k|x)
    • Ranking – predict and sort by relevance
    • Multi-Label Classification – predict p(y{1}k|x)
  • Learning from experience and data
    • In what form can the training data be obtained?
    • What is known a priori?
  • Complexity of training
  • Complexity of prediction

Are These Problems Distinct?

in this course

Supervised learning

      • Classification
        • Generative methods
          • Nearest neighbour, Naïve Bayes
        • Discriminative methods
          • Logistic Regression
        • Discriminant methods
          • Support Vector Machines
      • Regression, Ranking, Feature Selection, etc.
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning

In This Course

learning from noisy data

Noise and uncertainty

    • Unknown generative model Y = f(X)
    • Noise in measuring input and feature extraction
    • Noise in labels
    • Nuisance variables
    • Missing data
    • Finite training set size

Learning from Noisy Data

probability theory

Non-negativity and unit measure

    • 0 ≤ p(y) , p() = 1, p() = 0
  • Conditional probability – p(y|x)
    • p(x, y) = p(y|x) p(x) = p(x|y) p(y)
  • Bayes’ Theorem
    • p(y|x) = p(x|y) p(y) / p(x)
  • Marginalization
    • p(x) = yp(x, y) dy
  • Independence
    • p(x1, x2) = p(x1) p(x2)  p(x1|x2) = p(x1)
  • Chris Bishop, “Pattern Recognition & Machine Learning”

Probability Theory

probability distribution functions

Bernoulli: Single trial with probability of success =

    • n {0, 1}, [0, 1]
    • p(n|) = n(1 – )1-n
  • Binomial: N iid Bernoulli trials with n successes
    • n {0, 1, …, N},  [0, 1],
    • p(n|N,) = NCnn(1 – )N-n
  • Multinomial: N iid trials, outcome k occurs nk times
    • nk {0, 1, …, N}, knk = N, k [0, 1], kk = 1
    • p(n|N,) = N! kknk / nk!

Probability Distribution Functions

a toy example

We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips.

  • We are asked to predict whether the next coin flip will result in a head or a tail.
  • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail
  • We should predict heads if p(y=1|n,N) > p(y=0|n,N)

A Toy Example

the maximum likelihood approach

Let p(y=1|n,N) =  and p(y=0|n,N) = 1 -  so that we should predict heads if  > ½

  • How should we estimate ?
  • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of  that maximizes the likelihood of observing the data
  • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n
      • = argmaxn log() + (N – n) log(1 – )
      • = n / N
  • We should predict heads if n > ½ N

The Maximum Likelihood Approach

the maximum a posteriori approach

We should choose the value of  maximizing the posterior probability of  conditioned on the data

  • We assume a
    • Binomial likelihood : p(n|) = NCnn(1 – )N-n
    • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b)
  • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b)
    • = argmaxn (1 – )N-na-1 (1–)b-1
      • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails
  • We should predict heads if n > ½ (N + b – a)

The Maximum A Posteriori Approach

the bayesian approach

We should marginalize over 

  • p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d
              • = p(|a,b,n) d
              • = (|a + n, b + N –n) d
              • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails
  • We should predict heads if n > ½ (N + b – a)
  • The Bayesian and MAP prediction coincide in this case
  • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N)

The Bayesian Approach

approaches to classification

Memorization

    • Can not deal with previously unseen data
    • Large scale annotated data acquisition cost might be very high
  • Rule based expert system
    • Dependent on the competence of the expert.
    • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc.
    • Rules might not transfer to similar problems
  • Learning from training data and prior knowledge
    • Focuses on generalization to novel data

Approaches to Classification

notation

Training Data

    • Set of N labeled examples of the form (xi, yi)
    • Feature vector – xD. X = [x1x2 … xN]
    • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y)
  • Example – Gender Identification

Notation

(x1 = , y1 = +1)

(x2 = , y2 = +1)

(x3 = , y3 = +1)

(x4 = , y4 = -1)

slide26

Binary Classification

b

w

wtx + b = 0

 = [w; b]

bayes decision rule

Bayes’ decision rule

    • p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1
    •  p(y=+1|x) > ½ ? y = +1 : y = -1

Bayes’ Decision Rule

issues to think about

Bayesian versus MAP versus ML

    • Should we choose just one function to explain the data?
    • If yes, should this be the function that explains the data the best?
    • What about prior knowledge?
  • Generative versus Discriminative
    • Can we learn from “positive” data alone?
    • Should we model the data distribution?
    • Are there any missing variables?
    • Do we just care about the final decision?

Issues to Think About

bayesian approach

p(y|x,X,Y) = fp(y,f|x,X,Y) df

  • = fp(y|f,x,X,Y) p(f|x,X,Y) df
  • = fp(y|f,x) p(f|X,Y) df
  • This integral is often intractable.
  • To solve it we can
    • Choose the distributions so that the solution is analytic (conjugate priors)
    • Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods)
    • Sample from p(f|X,Y) (MCMC)

Bayesian Approach

maximum a posteriori map

p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df

  • = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP)
  • The more training data there is the better p(f|X,Y) approximates a delta function
  • We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP.

Maximum A Posteriori (MAP)

map maximum likelihood ml

fMAP = argmaxfp(f|X,Y)

  • = argmaxfp(X,Y|f) p(f) / p(X,Y)
  • = argmaxfp(X,Y|f) p(f)
  • fML  argmaxfp(X,Y|f) (Maximum Likelihood)
  • Maximum Likelihood holds if
    • There is a lot of training data so that
    • p(X,Y|f) >> p(f)
    • Or if there is no prior knowledge so that p(f) is uniform (improper)

MAP & Maximum Likelihood (ML)

iid data

fML = argmaxfp(X,Y|f)

  • = argmaxfIp(xi,yi|f)
  • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels.
  • In particular, p(X,Y) Ip(xi,yi)

IID Data

generative methods

MAP = argmaxp() Ip(xi,yi| )

  • = argmaxp(x) p(y) Ip(xi,yi| )
  • = argmaxp(x) p(y) Ip(xi|yi,) p(yi|)
  • = argmaxp(x) p(y) Ip(xi|yi,) p(yi|)
  • = [argmaxxp(x) Ip(xi|yi,x)] *
  • [argmaxyp(y) Ip(yi|y)]
  • x and y can be solved for independently
  • The parameters of each class decouple and can be solved for independently

Generative Methods

generative methods na ve bayes36

MAP = [argmaxxp(x) Ip(xi|yi,x)] *

  • [argmaxyp(y) Ip(yi|x)]
  • Naïve Bayes assumptions
    • Independent Gaussian features
        • p(xi|yi,x) = jp(xij|yi,x)
        • p(xij|yi=1,x) = N(xij| j1, i)
    • Improper uniform priors (no prior knowledge)
        • p(x) = p(y) = const
    • Bernoulli labels
        • p(yi=+1|y) = , p(yi=-1|y) = 1-

Generative Methods – Naïve Bayes

generative methods na ve bayes37

ML = [argmaxxIjN(xij| j1, i)] *

  • [argmaxI  (1+yi)/2 (1-)(1-yi)/2]
  • Estimating ML
  • ML = argmaxI  (1+yi)/2 (1-)(1-yi)/2
  • = argmax (N+I yi) log()+ (N-I yi) log(1-)
  • = N+ / N (by differentiating and setting to zero)
  • Estimating ML, ML
  • ML = (1 / N)  yi=1xi
  • 2jML = [ yi=+1 (xij - +jML)2 +  yi=-1 (xij - -jML)2 ]/N

Generative Methods – Naïve Bayes

na ve bayes prediction40

p(y=+1|x) = p(x|y=+1) p(y=+1) / p(x)

  • = 1 / (1 + exp(log(p(y=-1)/ p(y=+1))
  • +log(p(x|y=-1) / p(x|y=+1)))
  • = 1 / (1 + exp( log(1/ - 1) - ½ -t-1-
  • + ½ +t-1+ + (+- -)t-1x ))
  • = 1 / (1 + exp(-b – wtx)) (Logistic Regression)
  • p(y=-1|x)= exp(-b – wtx) / (1 + exp(-b – wtx))
  • log(p(y=-1|x)/ p(y=+1|x)) = -b – wtx
  • y = sign(b + wtx)
  • The decision boundary will be linear!

Naïve Bayes – Prediction

discriminative methods

MAP = argmaxp() Ip(xi,yi| )

  • We assume that
    • p() = p(w) p(w)
    • p(xi,yi| ) = p(yi| xi, ) p(xi| )
    • = p(yi| xi, w) p(xi| w)
  • MAP = [argmaxwp(w) Ip(yi| xi, w)] *
  • [argmaxwp(w) Ip(xi|w)]
  • It turns out that only w plays no role in determining the posterior distribution
  • p(y|x,X,Y) = p(y|x, MAP) = p(y|x, wMAP)
  • where wMAP = argmaxwp(w) Ip(yi| xi, w)

Discriminative Methods

disc methods logistic regression

MAP = argmaxw,bp(w) Ip(yi| xi, w)

  • Regularized Logistic Regression
    • Gaussian prior – p(w) = exp( -½ wtw)
    • Logistic likelihood–
        • p(yi| xi, w) = 1 / (1 + exp(-yi(b + wtxi)))

Disc. Methods – Logistic Regression

regularized logistic regression

MAP = argmaxw,bp(w) Ip(yi| xi, w)

  • = argminw,b ½wtw+ I log(1+exp(-yi(b+wtxi)))
  • Bad news: No closed form solution for w and b
  • Good news: We have to minimize a convex function
    • We can obtain the global optimum
    • The function is smooth
  • Tom Minka, “A comparison of numerical optimizers for LR” (Matlab code)
  • Keerthi et al., “A Fast Dual Algorithm for Kernel Logistic Regression”, ML 05
  • Andrew and Gao, “OWL-QN” ICML 07
  • Krishnapuram et al., “SMLR” PAMI 05

Regularized Logistic Regression

convex functions

Convex f : f(x1 + (1- )x2)  f(x1) + (1- )f(x2)

  • The Hessian 2f is always positive semi-definite
  • The tangent is always a lower bound to f

Convex Functions

gradient descent

Iteration : xn+1 = xn - nf(xn)

  • Step size selection : Armijo rule
  • Stopping criterion : Change in f is “miniscule”

Gradient Descent

gradient descent logistic regression

(w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))

  • w(w, b) =w –Ip(-yi|xi,w) yi xi
  • b(w, b) = –Ip(-yi|xi,w) yi
  • Beware of numerical issues while coding!

Gradient Descent – Logistic Regression

newton methods

Iteration : xn+1 = xn - nH-1f(xn)

  • Approximate f by a 2nd order Taylor expansion
  • The error can now decrease quadratically

Newton Methods

quasi newton methods

Computing and inverting the Hessian is expensive

  • Quasi-Newton methods can approximate H-1 directly (LBFGS)
  • Iteration : xn+1 = xn - nBn-1f(xn)
  • Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn)
  • The secant equation does not fully determine B
  • LBFGS updates Bn+1-1 using two rank one matrices

Quasi-Newton Methods

generative versus discriminative

A discriminative model might be correct even when the corresponding generative model is not

  • A discriminative model has fewer parameters than the corresponding generative model
  • A generative models parameters are uncoupled and can often be estimated in closed form
  • A discriminative models parameters are correlated and training algorithms can be relatively expensive
  • A discriminative model often has lower test error given a “reasonable” amount of training data.
  • A generative model can deal with missing data

Generative versus Discriminative

generative versus discriminative56

Let (hA,N) denote the error of hypothesis h trained using algorithm A on N data points

  • When the generative model is correct
        • (hDis,) = (hGen,)
  • When the generative model is incorrect
              • (hDis,) (hGen,)
  • For a linear classifier trained in D dimensions
    • (hDis,N)  (hDis,) + O( [-z log z]½) where z=D/N1
  • It suffices to pick N = (D) points for discriminative learning of linear classifiers
  • For some generative models N = (log D)

Generative versus Discriminative

generative versus discriminative57

A generative classifier might converge much faster to its higher asymptotic error

  • Ng & Jordan, “On Discriminative vs. Generative Classifiers” NIPS 02.
  • Tom Mitchell, “Generative and Discriminative Classifiers“

Generative versus Discriminative

multi class logistic regression

Multinomial Logistic Regression

  • 1-vs-All
    • Learn L binary classifiers for an L class problem
    • For the lth classifier, examples from class l are +ve while examples from all other classes are –ve
    • Classify new points according to max probability
  • 1-vs-1
    • Learn L(L-1)/2 binary classifiers for an L class problem by considering every class pair
    • Classify novel points by majority vote
    • Classify novel points by building a DAG

Multi-class Logistic Regression

multi class logistic regression59

Assume

    • Non-linear multi-class classifier
    • Number of classes = L
    • Number of training points per class = N
    • Algorithm training time for M points = O(M3)
    • Classification time given M training points=O(M)

Multi-class Logistic Regression

multi class logistic regression60

Multinomial Logistic Regression

    • Training time = O(L6N3)
    • Classification time for a new point = O(L2N)
  • 1-vs-All
    • Training time = O(L4N3)
    • Classification time for a new point = O(L2N)
  • 1-vs-1
    • Training time = O(L2N3)
    • Majority vote classification time = O(L2N)
    • DAG classification time = O(LN)

Multi-class Logistic Regression

multinomial logistic regression

MAP = argmaxw,bp(w) Ip(yi| xi, w)

  • Regularized Multinomial Logistic Regression
    • Gaussian prior
        • p(w) = exp( -½ lwltwl)
    • Multinomial logistic posterior
        • p(yi = l | xi, w) = efl(xi) / kefk(xi)
        • where fk(xi) = wktxi + bk
      • Note that we have to learn an extra classifier by not explicitly enforcing lp(yi = l | xi, w) = 1

Multinomial Logistic Regression

multinomial logistic regression62

(w, b) = ½ kwktwk+ I [log(kfk(xi)) - kkyifk(xi)]

  • wk(w, b) =wk +I [ p(yi = k | xi,w) - kyi ]xi
  • bk(w, b) =I [ p(yi = k | xi,w) - kyi ]

Multinomial Logistic Regression

slide73

Maximum Margin Hyperplane

  • Geometric Intuition: Choose the perpendicular bisector of the shortest line segment joining the convex hulls of the two classes
slide74

SVM Notation

Margin = 2 /wtw

  • Support Vector

b

  • Support Vector
  • Support Vector
  • Support Vector

w

wtx + b = -1

wtx + b = 0

wtx + b = +1

calculating the margin

Let x+ be any point on the +ve supporting plane and x- the closest point on the –ve supporting plane

  • Margin = |x+ – x-|
  • =  |w| (since x+ = x- + w)
  • = 2 |w|/|w|2 (assuming  = 2/|w|2)
  • = 2/|w|
  • wtx+ + b = +1
  • wtx- + b = -1
  •  wt(x+ – x-)= 2   wtw= 2   = 2/|w|2

Calculating the Margin

hard margin svm primal

Maximize 2/|w|

  • such that wtxi + b +1 if yi = +1
  • wtxi + b -1 if yi = -1
  • Difficult to optimize directly
  • Convex Quadratic Program (QP) reformulation
  • Minimize ½wtw
  • such that yi(wtxi + b)  1
  • Convex QPs can be easy to optimize

Hard Margin SVM Primal

linearly inseparable data

Minimize ½wtw + C #(Misclassified points)

  • such that yi(wtxi + b)  1 (for “good” points)
  • The optimization problem is NP Hard in general
  • Disastrous errors are penalized the same as near misses

Linearly Inseparable Data

slide78

Inseparable Data – Hinge Loss

Margin = 2 /wtw

 > 1

  • Misclassified point

 < 1

b

  • Support Vector

 = 0

  • Support Vector

w

wtx + b = -1

 = 0

wtx + b = 0

wtx + b = +1

the c svm primal formulation

Minimize ½wtw + C ii

  • such that yi(wtxi + b)  1 – i
  • i 0
  • The optimization is a convex QP
  • The globally optimal solution will be obtained
  • Number of variables = D + N + 1
  • Number of constraints = 2N
  • Solvers can train on 800K points in 47K (sparse) dimensions in less than 2 minutes on a standard PC
  • Fan et al., “LIBLINEAR” JMLR 08
  • Bordes et al., “LaRank” ICML 07

The C-SVM Primal Formulation

the c svm dual formulation

Maximize 1t – ½tYKY

  • such that 1tY = 0
  • 0    C
  • K is a kernel matrix such that Kij = K(xi, xj) = xitxj
  •  are the dual variables (Lagrange multipliers)
  • Knowing  gives us w and b
  • The dual is also a convex QP
    • Number of variables = N
    • Number of constraints = 2N + 1
  • Fan et al., “LIBSVM” JMLR 05
  • Joachims, “SVMLight”

The C-SVM Dual Formulation

svms versus regularized lr

SVMs versus Regularized LR

Most of the SVM s are zero!

svms versus regularized lr82

SVMs versus Regularized LR

Most of the SVM s are zero!

svms versus regularized lr83

SVMs versus Regularized LR

Most of the SVM s are not zero

duality

Primal P = Minxf0(x)

  • s. t. fi(x) 0 1  i  N
  • hi(x)= 0 1  i  M
  • Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x)
  • Dual D = Max,Minx L(x,,)
  • s. t.  0

Duality

duality85

The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize

  • Weak duality : P  D
    • Always holds
  • Strong duality : P = D
    • Does not always hold
    • Usually holds for convex problems
    • Holds for the SVM QP

Duality

karush kuhn tucker kkt conditions

If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold

  • Primal feasibility : fi(x*) 0 & hi(x*)= 0 for 1  i
  • Dual feasibility : *  0
  • Stationarity : xL(x*, *,*) = 0
  • Complimentary slackness : i*fi(x*)= 0
  • If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal

Karush-Kuhn-Tucker (KKT) Conditions

svm duality

Primal P = Minw,,b½wtw + Ct

  • s. t. Y(Xtw + b1) 1 – 
  • 0
  • Lagrangian L(,, w,,b) = ½wtw + Ct – t
  • –t[Y(Xtw + b1) – 1 + ]
  • Dual D = Max 1t – ½tYKY
  • s. t. 1tY = 0
  • 0    C

SVM – Duality

svm kkt conditions

Lagrangian L(,, w,,b) = ½wtw + Ct – t

  • –t[Y(Xtw + b1) – 1 + ]
  • Stationarity conditions
    • wL= 0  w* = XY* (Representer Theorem)
    • L= 0  C = * + *
    • bL= 0  *tY1 = 0
  • Complimentary Slackness conditions
    • i* [ yi (xitw* + b*) – 1 + i*] = 0
    • i*i* = 0

SVM – KKT Conditions

hinge loss and sparseness in

Misclassifications and margin violations

    • yif(xi) <1  i* > 0 i* = 0 i* = C
  • Support vectors
    • yif(xi) =1  i* = 0 & 0 ≤ i* ≤ C
  • Correct classifications
    • yif(xi) > 1  yif(xi) – 1 + i*> 0 i* = 0

Hinge Loss and Sparseness in 

linearly inseparable data91

This 1D dataset can not be separated using a single hyperplane (threshold)

  • We need a non-linear decision boundary

x

Linearly Inseparable Data

the kernel trick

Let the “lifted” training set be { ((xi), yi) }

  • Define the kernel such that Kij = K(xi, xj) = (xi)t (xj)
  • Primal P = Minw,,b½wtw + Ct
  • s. t. Y((X)tw + b1) 1 – 
  • 0
  • Dual D = Max 1t – ½tYKY
  • s. t. 1tY = 0
  • 0    C
  • Classifier: f(x) = sign((x)tw + b) = sign(tYK(:,x) + b)

The Kernel Trick

the kernel trick94

Let (x) = [1, 2x1, … , 2xD , x12, … , xD2, 2x1x2, …, 2x1xD, …, 2xD-1xD]t

  • Define K(xi, xj) = (xi)t (xj) = (xitxj + 1)2
  • Primal
    • Number of variables = D + N + 1
    • Number of constraints = 2N
    • Number of flops for calculating (x)tw = O(D2)
    • Number of flops for deg 20 polynomial = O(D20)
  • Dual
    • Number of variables = N
    • Number of constraints = 2N + 1
    • Number of flops for calculating Kij= O(D)
    • Number of flops for deg 20 polynomial = O(D)

The Kernel Trick

some popular kernels

Linear : K(xi,xj) = xit-1xj

  • Polynomial : K(xi,xj) = (xit-1xj + c)d
  • Gaussian (RBF) : K(xi,xj) = exp( –kk(xik – xjk)2)
  • Chi-Squared : K(xi,xj) = exp( –2(xi, xj) )
  • Sigmoid : K(xi,xj) = tanh(xitxj – c)
  •  should be positive definite, c  0,   0 and d should be a natural number

Some Popular Kernels

valid kernels mercer s theorem

Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

  • Z  Zf(x) K(x,z) f(z) dxdz  0
  • for all square integrable real valued function f on Z.

Valid Kernels – Mercer’s Theorem

valid kernels mercer s theorem97

Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

  • Z  Zf(x) K(x,z) f(z) dxdz  0
  • for all square integrable real valued function f on Z.
  • K is a kernel if every finite symmetric matrix formed by evaluating K on pairs of points from Z is positive semi-definite

Valid Kernels – Mercer’s Theorem

operations on kernels

The following operations result in valid kernels

    • K(xi,xj) = kkKk(xi,xj) (k  0)
    • K(xi,xj) = kKk(xi,xj)
    • K(xi,xj) = f(xi) f(xj) (f : D  )
    • K(xi,xj) = p(K1(xi,xj)) (p : +ve coeff poly)
    • K(xi,xj) = exp(K1(xi,xj))
  • Kernels can be defined over graphs, sets, strings and many other interesting data structures

Operations on Kernels

kernels

Kernels should encode all our prior knowledge about feature similarities.

  • Kernel parameters can be chosen through cross validation or learnt (see Multiple Kernel Learning).
  • Non-linear kernels can sometimes boost classification performance tremendously.
  • Non-linear kernels are generally expensive (both during training and for prediction)

Kernels

structured output prediction

Minimize f½|f|2+ C ii

  • such that f(xi,yi) f(xi,y) + (yi,y) – i yyi
  • i 0
  • Prediction argmaxyf(x,y)
  • This formulation minimizes the hinge on the loss  on the training set subject to regularization on f
  • Can be used to predict sets, graphs, etc. for suitable choices of 
  • Taskar et al., “Max-Margin Markov Networks” NIPS 03
  • Tsochantaridis et al., “Large Margin Methods for Structured & Interdependent Output Variables” JMLR 05

Structured Output Prediction

multi class svm

Minimize f½|f|2+ C ii

  • such that f(xi,yi) f(xi,y) + (yi,y) – i yyi
  • i 0
  • Prediction argmaxyf(x,y)
  • (yi,y) = 1 – yi,y
  • f(x,y) = wt [ (x) (y) ]
  • = wyt(x) (assuming (y) =ey)
  • Weston and Watkin, “SVMs for Multi-Class Pattern Recognition” ESANN 99
  • Bordes et al., “LaRank” ICML 07

Multi-Class SVM

multi class primal dual prediction

P=Minw,½

  • s. t.
  • D=Max
  • s. t.
  • y* = argmaxy
  • = argmaxy

Multi-Class Primal, Dual & Prediction

multi class svm dual

For L classes, with N points per class, the total number of dual variables is NL2

  • Finding the exact solution for real world non-linear problems is often infeasible
  • In practice, we can obtain an approximate solution or switch to the 1-vs-All or 1-vs-1 formulations

Multi-Class SVM Dual

multi class classification111

Assume

    • A mon-linear problem with L classes and N points/class
    • SMO training is cubic in the number of dual variables
    • The number of support vectors is the same order as the number of training points

Multi-Class Classification