1 / 38

Mathematical Programming in Support Vector Machines

Mathematical Programming in Support Vector Machines. Olvi L. Mangasarian University of Wisconsin - Madison. High Performance Computation for Engineering Systems Seminar MIT October 4, 2000. What is a Support Vector Machine?. An optimally defined surface

dean
Download Presentation

Mathematical Programming in Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering Systems Seminar MIT October 4, 2000

  2. What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function

  3. What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)

  4. Example of Nonlinear Classifier:Checkerboard Classifier

  5. Outline of Talk • Generalized support vector machines (SVMs) • Completely general kernel allows complex classification (No Mercer condition!) • Smooth support vector machines • Smooth & solve SVM by a fast Newton method • Lagrangian support vector machines • Very fast simple iterative scheme- • One matrix inversion: No LP. No QP. • Reduced support vector machines • Handle large datasets with nonlinear kernels

  6. Generalized Support Vector Machines2-Category Linearly Separable Case A+ A-

  7. Given m points in n dimensional space • Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case

  8. Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-

  9. Solve the following mathematical program for some : • The nonnegative slack variable is zero iff: • Convex hulls of and do not intersect • is sufficiently large Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation

  10. Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant

  11. Another Application: Disputed Federalist PapersBosch & Smith 199856 Hamilton, 50 Madison, 12 Disputed

  12. Linear SVM: Linear separating surface: • Set . Resulting linear surface: • Replace by arbitrary nonlinear kernel • Resulting nonlinear surface: Generalized Support Vector Machine Motivation(Nonlinear Kernel WithoutMercer Condition)

  13. SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem) Changing to 2-norm and measuring margin in( ) space:

  14. Smoothing the Plus Function:Integrate the Sigmoid Function

  15. Integrating the sigmoid approximation to the step function: gives a smooth, excellent approximation to the plus function: • Replacing the plus function in the nonsmooth SVM by the smooth approximation gives our SSVM: SSVM: The Smooth Support Vector MachineSmoothing the Plus Function

  16. Newton: Minimize a sequence of quadratic approximations to the strongly convex objective function, i.e. solve a sequence of linear equations in n+1 variables. (Small dimensional input space.) Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!) Global Quadratic Convergence: Starting from any point, the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)

  17. SSVM with a Nonlinear KernelNonlinear Separating Surface in Input Space

  18. Polynomial Kernel • Gaussian (Radial Basis) Kernel • Neural Network Kernel Examples of KernelsGenerate Nonlinear Separating Surfaces in Input Space

  19. Taking the dual of the SVM formulation: , gives the following simple dual problem: The variables of SSVM are related to by: LSVM: Lagrangian Support Vector MachineDual of SVM

  20. Defining the two matrices: Reduces the dual SVM to: The optimality condition for this dual SVM is the LCP: which, by Implicit Lagrangian Theory, is equivalent to: LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem

  21. Where: LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion Key Idea: Sherman-Morrison-Woodbury formula allows the inversion inversion of an extremely large m-by-m matrix Q by merely inverting a much smaller n-by-n matrix as follows:

  22. LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;function pl = pl(x); pl = (abs(x)+x)/2;

  23. SVM classified in 178 seconds & 4497 iterations LSVMAlgorithm – Linear KernelComputational Results • 2 Million random points in 10 dimensional space • Classified in 6.7 minutes in 6 iterations & e-5 accuracy • 250 MHz UltraSPARC II with 2 gigabyte memory • CPLEX ran out of memory • 32562 points in 123-dimensional space (UCI Adult Dataset) • Classified in141 seconds & 55 iterations to 85% correctness • 400 MHz Pentium II with 2 gigabyte memory

  24. For the nonlinear kernel: the separating nonlinear surface is given by: Where u is the solution of the dual problem: with Q redefined as: LSVM– Nonlinear KernelFormulation

  25. LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy

  26. Key idea: Use a rectangular kernel. where is a small random sample of has 1% to 10% of the rows of Typically Two important consequences: only Nonlinear separator depends on Separating surface: gives lousy results Reduced Support Vector Machines (RSVM)Large Nonlinear Kernel Classification Problems • RSVM can solve very large problems

  27. Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000

  28. RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000

  29. RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)

  30. Conclusion • Mathematical Programming plays an essential role in SVMs • Theory • New formulations • Generalized SVMs • New algorithm-generating concepts • Smoothing (SSVM) • Implicit Lagrangian (LSVM) • Algorithms • Fast : SSVM • Massive: LSVM, RSVM

  31. Chunking for massive classification: Future Research • Theory • Concave minimization • Concurrent feature & data selection • Multiple-instance problems • SVMs as complementarity problems • Kernel methods in nonlinear programming • Algorithms • Multicategory classification algorithms

  32. Talk & Papers Available on Web www.cs.wisc.edu/~olvi

More Related