Mathematical Programming in Support Vector Machines

1 / 38

# Mathematical Programming in Support Vector Machines - PowerPoint PPT Presentation

## Mathematical Programming in Support Vector Machines

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering Systems Seminar MIT October 4, 2000

2. What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function

3. What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)

4. Example of Nonlinear Classifier:Checkerboard Classifier

5. Outline of Talk • Generalized support vector machines (SVMs) • Completely general kernel allows complex classification (No Mercer condition!) • Smooth support vector machines • Smooth & solve SVM by a fast Newton method • Lagrangian support vector machines • Very fast simple iterative scheme- • One matrix inversion: No LP. No QP. • Reduced support vector machines • Handle large datasets with nonlinear kernels

6. Generalized Support Vector Machines2-Category Linearly Separable Case A+ A-

7. Given m points in n dimensional space • Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case

8. Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-

9. Solve the following mathematical program for some : • The nonnegative slack variable is zero iff: • Convex hulls of and do not intersect • is sufficiently large Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation

10. Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant

11. Another Application: Disputed Federalist PapersBosch & Smith 199856 Hamilton, 50 Madison, 12 Disputed

12. Linear SVM: Linear separating surface: • Set . Resulting linear surface: • Replace by arbitrary nonlinear kernel • Resulting nonlinear surface: Generalized Support Vector Machine Motivation(Nonlinear Kernel WithoutMercer Condition)

13. SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem) Changing to 2-norm and measuring margin in( ) space:

14. Smoothing the Plus Function:Integrate the Sigmoid Function

15. Integrating the sigmoid approximation to the step function: gives a smooth, excellent approximation to the plus function: • Replacing the plus function in the nonsmooth SVM by the smooth approximation gives our SSVM: SSVM: The Smooth Support Vector MachineSmoothing the Plus Function

16. Newton: Minimize a sequence of quadratic approximations to the strongly convex objective function, i.e. solve a sequence of linear equations in n+1 variables. (Small dimensional input space.) Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!) Global Quadratic Convergence: Starting from any point, the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)

17. Polynomial Kernel • Gaussian (Radial Basis) Kernel • Neural Network Kernel Examples of KernelsGenerate Nonlinear Separating Surfaces in Input Space

18. Taking the dual of the SVM formulation: , gives the following simple dual problem: The variables of SSVM are related to by: LSVM: Lagrangian Support Vector MachineDual of SVM

19. Defining the two matrices: Reduces the dual SVM to: The optimality condition for this dual SVM is the LCP: which, by Implicit Lagrangian Theory, is equivalent to: LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem

20. Where: LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion Key Idea: Sherman-Morrison-Woodbury formula allows the inversion inversion of an extremely large m-by-m matrix Q by merely inverting a much smaller n-by-n matrix as follows:

21. LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;function pl = pl(x); pl = (abs(x)+x)/2;

22. SVM classified in 178 seconds & 4497 iterations LSVMAlgorithm – Linear KernelComputational Results • 2 Million random points in 10 dimensional space • Classified in 6.7 minutes in 6 iterations & e-5 accuracy • 250 MHz UltraSPARC II with 2 gigabyte memory • CPLEX ran out of memory • 32562 points in 123-dimensional space (UCI Adult Dataset) • Classified in141 seconds & 55 iterations to 85% correctness • 400 MHz Pentium II with 2 gigabyte memory

23. For the nonlinear kernel: the separating nonlinear surface is given by: Where u is the solution of the dual problem: with Q redefined as: LSVM– Nonlinear KernelFormulation

24. LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy

25. Key idea: Use a rectangular kernel. where is a small random sample of has 1% to 10% of the rows of Typically Two important consequences: only Nonlinear separator depends on Separating surface: gives lousy results Reduced Support Vector Machines (RSVM)Large Nonlinear Kernel Classification Problems • RSVM can solve very large problems

26. Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000

27. RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)

28. Conclusion • Mathematical Programming plays an essential role in SVMs • Theory • New formulations • Generalized SVMs • New algorithm-generating concepts • Smoothing (SSVM) • Implicit Lagrangian (LSVM) • Algorithms • Fast : SSVM • Massive: LSVM, RSVM

29. Chunking for massive classification: Future Research • Theory • Concave minimization • Concurrent feature & data selection • Multiple-instance problems • SVMs as complementarity problems • Kernel methods in nonlinear programming • Algorithms • Multicategory classification algorithms

30. Talk & Papers Available on Web www.cs.wisc.edu/~olvi