168 Views

Download Presentation
## Mathematical Programming in Support Vector Machines

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Mathematical Programming in Support Vector Machines**Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering Systems Seminar MIT October 4, 2000**What is a Support Vector Machine?**• An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function**What are Support Vector Machines Used For?**• Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)**Outline of Talk**• Generalized support vector machines (SVMs) • Completely general kernel allows complex classification (No Mercer condition!) • Smooth support vector machines • Smooth & solve SVM by a fast Newton method • Lagrangian support vector machines • Very fast simple iterative scheme- • One matrix inversion: No LP. No QP. • Reduced support vector machines • Handle large datasets with nonlinear kernels**Generalized Support Vector Machines2-Category Linearly**Separable Case A+ A-**Given m points in n dimensional space**• Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case**Generalized Support Vector MachinesMaximizing the Margin**between Bounding Planes A+ A-**Solve the following mathematical program for some :**• The nonnegative slack variable is zero iff: • Convex hulls of and do not intersect • is sufficiently large Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation**Breast Cancer Diagnosis Application97% Tenfold Cross**Validation Correctness780 Samples:494 Benign, 286 Malignant**Another Application: Disputed Federalist PapersBosch & Smith**199856 Hamilton, 50 Madison, 12 Disputed**Linear SVM: Linear separating surface:**• Set . Resulting linear surface: • Replace by arbitrary nonlinear kernel • Resulting nonlinear surface: Generalized Support Vector Machine Motivation(Nonlinear Kernel WithoutMercer Condition)**SSVM: Smooth Support Vector Machine(SVM as Unconstrained**Minimization Problem) Changing to 2-norm and measuring margin in( ) space:**Integrating the sigmoid approximation to the step function:**gives a smooth, excellent approximation to the plus function: • Replacing the plus function in the nonsmooth SVM by the smooth approximation gives our SSVM: SSVM: The Smooth Support Vector MachineSmoothing the Plus Function**Newton: Minimize a sequence of quadratic approximations**to the strongly convex objective function, i.e. solve a sequence of linear equations in n+1 variables. (Small dimensional input space.) Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!) Global Quadratic Convergence: Starting from any point, the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)**SSVM with a Nonlinear KernelNonlinear Separating Surface in**Input Space**Polynomial Kernel**• Gaussian (Radial Basis) Kernel • Neural Network Kernel Examples of KernelsGenerate Nonlinear Separating Surfaces in Input Space**Taking the dual of the SVM formulation:**, gives the following simple dual problem: The variables of SSVM are related to by: LSVM: Lagrangian Support Vector MachineDual of SVM**Defining the two matrices:**Reduces the dual SVM to: The optimality condition for this dual SVM is the LCP: which, by Implicit Lagrangian Theory, is equivalent to: LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem**Where:**LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion Key Idea: Sherman-Morrison-Woodbury formula allows the inversion inversion of an extremely large m-by-m matrix Q by merely inverting a much smaller n-by-n matrix as follows:**LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code**function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;function pl = pl(x); pl = (abs(x)+x)/2;**SVM classified in 178 seconds & 4497 iterations**LSVMAlgorithm – Linear KernelComputational Results • 2 Million random points in 10 dimensional space • Classified in 6.7 minutes in 6 iterations & e-5 accuracy • 250 MHz UltraSPARC II with 2 gigabyte memory • CPLEX ran out of memory • 32562 points in 123-dimensional space (UCI Adult Dataset) • Classified in141 seconds & 55 iterations to 85% correctness • 400 MHz Pentium II with 2 gigabyte memory**For the nonlinear kernel:**the separating nonlinear surface is given by: Where u is the solution of the dual problem: with Q redefined as: LSVM– Nonlinear KernelFormulation**LSVM Algorithm – Nonlinear Kernel Application 100**Iterations, 58 Seconds on Pentium II, 95.9% Accuracy**Key idea: Use a rectangular kernel.**where is a small random sample of has 1% to 10% of the rows of Typically Two important consequences: only Nonlinear separator depends on Separating surface: gives lousy results Reduced Support Vector Machines (RSVM)Large Nonlinear Kernel Classification Problems • RSVM can solve very large problems**Conventional SVM Result on Checkerboard Using 50 Random**Points Out of 1000**RSVM Result on Checkerboard Using SAME 50 Random Points Out**of 1000**RSVM on Large Classification ProblemsStandard Error over 50**Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)**Conclusion**• Mathematical Programming plays an essential role in SVMs • Theory • New formulations • Generalized SVMs • New algorithm-generating concepts • Smoothing (SSVM) • Implicit Lagrangian (LSVM) • Algorithms • Fast : SSVM • Massive: LSVM, RSVM**Chunking for massive classification:**Future Research • Theory • Concave minimization • Concurrent feature & data selection • Multiple-instance problems • SVMs as complementarity problems • Kernel methods in nonlinear programming • Algorithms • Multicategory classification algorithms**Talk & Papers Available on Web**www.cs.wisc.edu/~olvi