Loading in 5 sec....

Using Analytic QP and Sparseness to Speed Training of Support Vector MachinesPowerPoint Presentation

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Using Analytic QP and Sparseness to Speed Training of Support Vector Machines' - nyssa-newman

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

John C. Platt

Presented by: Travis Desell

Overview Support Vector Machines

- Introduction
- Motivation
- General SVMs
- General SVM training
- Related Work

- Sequential Minimal Optimization (SMO)
- Choosing the smallest optimization problem
- Solving the smallest optimization problem

- Benchmarks
- Conclusion
- Remarks & Future Work
- References

Motivation Support Vector Machines

- Traditional SVM Training Algorithms
- Require quadratic programming (QP) package
- SVM training is slow, especially for large problems

- Sequential Minimal Optimization (SMO)
- Requires no QP package
- Easy to implement
- Often faster
- Good scalability properties

General SVMs Support Vector Machines

u = SiaiyiK(xi,x) – b (1)

- u : SVM output
- a : weights to blend different kernels
- y in {-1, +1} : desired output
- b : threshold
- xi : stored training example (vector)
- x : input (vector)
- K : kernel function to measure similarity of xi to xi

General SVMs (2) Support Vector Machines

- For linear SVMs, K is linear, thus (1) can be expressed as the dot product of w and x minus the threshold:
u = w * x – b (2)

w = Siaiyixi (3)

- Where w, x, and xi are vectors

General SVM Training Support Vector Machines

- Training an SVM is finding ai, expressed as minimizing a dual quadratic form:
minaY(a) = mina ½ Si SjyiyjK(xi, xj)aiaj – Siai (4)

- Subject to box constraints:
0 <= ai <= C, for all I (5)

- And the linear equality constraint:
Siyiai = 0 (6)

- The ai are Lagrange multipliers of a primal QP problem: there is a one-to-one correspondence between each ai and each training example xi

General SVM Training (2) Support Vector Machines

- SMO solves the QP expressed in (4-6)
- Terminates when all of the Karush-Kuhn-Tucker (KKT) optimality conditions are fulfilled:
ai = 0 -> yiui >= 1 (7)

0 < ai < C -> yiui = 1 (8)

ai = C -> yiui <= 1 (9)

- Where ui is the SVM output for the ith training example

Related Work Support Vector Machines

- “Chunking” [9]
- Removing training examples with ai = 0 does not change solution.
- Breaks down large QP problem into smaller sub-problems to identify non-zero ai.
- The QP sub-problem consists of every non-zero ai from previous sub-problem combined with M worst examples that violate (7-9) for some M [1].
- Last step solves the entire QP problem as all non-zero ai have been found.
- Cannot handle large-scale training problems if standard QP techniques are used. Kaufman [3] describes QP algorithm to overcome this.

Related Work (2) Support Vector Machines

- Decomposition [6]:
- Breaks the large QP problem into smaller QP sub-problems.
- Osuna et al. [6] suggest using fixed size matrix for every sub-problem – allows very large training sets.
- Joachims [2] suggests adding and subtracting examples according to heuristics for rapid convergence.
- Until SMO, requires numerical QP library, which can be costly or slow.

Sequential Minimal Optimization Support Vector Machines

- SMO decomposes the overall QP problem (4-6), into fixed size QP sub-problems.
- Chooses the smallest optimization problem (SOP) at each step.
- This optimization consists of two elements of a, because of the linear equality constraint.

- SMO repeatedly chooses two elements of a to jointly optimize until the overall QP problem is solved.

Choosing the SOP Support Vector Machines

- Heuristic based approach
- Terminates when the entire training set obeys (7-9) within e (typically <= 10-3)
- Repeatedly finds a1 and a2 and optimizes until termination

Finding Support Vector Machinesa1

- “First choice heuristic”
- Searches through examples most likely to violate conditions (non-bound subset)
- ai at the bounds likely to stay there, non-bound ai will move as others are optimized

- “Shrinking Heuristic”
- Finds examples which fulfill (7-9) more than the worst example failed
- Ignores these examples until a final pass at the end to ensure all examples fulfill (7-9)

Finding Support Vector Machinesa2

- Chosen to maximize the size of the step taken during the joint optimization of a1 and a2
- Each non-bound has a cached error value E for each non-bound example
- If E1 is negative, chooses a2 with minimum E2
- If E1 is positive, chooses a2 with maximum E2

Solving the SOP Support Vector Machines

- Computes minimum along the direction of the linear equality constant:
a2new = y2(E1-E2)/(K(x1,x1)+K(x2,x2)–2K(x1, x2)) (10)

Ei = ui-yi (11)

- Clips a2new within [L,H]:
L = max(0,a2+sa1-0.5(s+1)C) (12)

H = min(C,a2+sa1-0.5(s-1)C) (13)

s = y1y2 (14)

- Calculates a1new:
a1new = a1 + s(a2 – a2new,clipped) (15)

Benchmarks Support Vector Machines

- UCI Adult: SVM is given 14 attributes of a census and is asked to predict if household income is greater than $50k. 8 categorical attributes, 6 continues = 123 binary attributes.
- Web: classify if a web page is in a category or not. 300 sparse binary keyword attributes.
- MNIST: One classifier is trained. 784-dimensional, non-binary vectors stored as sparse vectors.

Description of Benchmarks Support Vector Machines

- Web and Adult are trained with linear and Gaussian SVMs.
- Performed with and without sparse inputs, with and without kernel caching
- PCG chunking always uses caching

Benchmarking SMO Support Vector Machines

Conclusions Support Vector Machines

- PCG chunking slower than SMO, SMO ignores examples whose Lagrange multipliers are at C.
- Overhead of PCG chunking not involved with kernel (kernel optimizations do not greatly effect time).

Conclusions (2) Support Vector Machines

- SVMlight solves 10 dimensional QP sub-problems.
- Differences mostly due to kernel optimizations and numerical QP overhead.
- SMO faster on linear problems due to linear SVM folding, but SVMlight can potentially use this as well.
- SVMlight benefits from complex kernel cache while SVM does have a complex kernel cache and thus does not benefit from it at large problem sizes.

Remarks & Future Work Support Vector Machines

- Heuristic based approach to finding a1 and a2 to optimize:
- Possible to determine optimal choice strategy to minimize the number of steps?

- Proof that SMO always minimizes the QP problem?

References Support Vector Machines

- [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.
- [2] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1998.

References (2) Support Vector Machines

- [3] L. Kaufman. Solving the quadratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 147–168. MIT Press, 1998.
- [6] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Neural Networks in Signal Processing ’97, 1997.

References (3) Support Vector Machines

- [9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.

Download Presentation

Connecting to Server..