1 / 24

Support Vector Machine (SVM)

Support Vector Machine (SVM). MUMT611 Beinan Li Music Tech @ McGill 2005-3-17. Content. Related problems in pattern classification VC theory and VC dimension Overview of SVM Application example. Related problems in pattern classification. Small sample-size effect (peaking effect)

Download Presentation

Support Vector Machine (SVM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machine (SVM) • MUMT611 • Beinan Li • Music Tech @ McGill • 2005-3-17

  2. Content • Related problems in pattern classification • VC theory and VC dimension • Overview of SVM • Application example

  3. Related problems in pattern classification • Small sample-size effect (peaking effect) • Overly small or large sample-size results great error. • Inaccurate estimate of probability densities via finite sample sets for global set in typical Bayesian classifier. • Training data vs. test data • Empirical risk vs. structural risk • Misclassifying yet-to-be-seen data Picture taken from (Ridder 1997)

  4. Related problems in pattern classification • Avoid solving a more general problem as an intermediate step. (Vapnik 1995) • Do it without estimation of probability of densities. • ANN • Depends on knowledge • Empirical-risk method (ERM): • Problem of generalization (hard to control over-fitting) • To find theoretical analysis for validity of ERM.

  5. VC theory and VC dimension • VC dimension: (classifier complexity) • The maximum size of a sample set that a decision function can separate. • Finite VC dimension coherence of ERM • Theoretical basis of ANN and SVM • Linear decision function: • VC dim = number of parameters • Non-linear decision function: • VC dim <= number of parameters

  6. Overview of SVM • Structural-risk method (SRM) • Minimize ER • Control VC dimension • Result: tradeoff between ER and over-fitting • Focus on the explicit problem of classification: • To find the optimal hyperplane for dividing two classes • Supervised learning

  7. Margin and Support Vectors (SV) • In the case of 2-category, linearly-separable data. • Small vs. large margin Picture taken from (Ferguson 2004)

  8. Margin and Support Vectors (SV) • In the case of 2-category, linearly-separable data. • Find a hyperplane that has the largest margin to sample vectors of both classes. • D(x) = wtx +b => D(x’) = atx’ • Multiple solutions: weight space • Find a weight that causes the largest margin • Margin determined by SVs Picture taken from (Ferguson 2004)

  9. Mathematical detail • yiD(xi) >= 1, y = 1, -1 • yiD(xi’) / ||a|| >= margin • D(xi’) = atx’ • Max margin -> minimum ||a|| • Quadratic programming • To find the minimum ||a|| under linear constraints • Weights: denoted by Lagrange multipliers • Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction) • The parameters of decision function and its complexity can be completely determined by SVs.

  10. Linearly non-separable case • Example: XOR problem • Sample set size: 4 • VC dim = 3 Pictures taken from (Ferguson 2004)

  11. Linearly non-separable case • Map data to higher-dimension space • Linearly-separable in such Higher-D spaces • Make linear decision in higher-D spaces • Example: XOR • 6-D space: • D(x) = x1x2 Picture taken from (Ferguson 2004)

  12. Linearly non-separable case • Hyperplane in both original and higher-D spaces (trajectory to 2-D plane) • The 4 samples are SVs. Picture taken from (Ferguson 2004; Luo 2002)

  13. Linearly non-separable case • Modify the quadratic programming : • “Soft margin” • Slack-variable: yiD(xi) >= 1- εi • Penalty function • Upper bound for Lagrange multipliers: C. • Kernel function: • Dot-product in higher-D space in terms of original parameters • Resulting a symmetrical, positive semi-definite matrix. • Satisfying Mercer’s theorem. • Standard candidate: Polynomial, Gussian-Radial-basis Function • Selection of kernel depends on knowledge.

  14. Implementation with large sample set • Large computation: One Lagrange multiplier per sample • Reductionist approach • Divide sample set into batches (subsets) • Accumulate SV set from batch-by-batch operations • Assumption: local non-SV samples are not global SVs either. • Several algorithms that varies in terms of size of subsets • Vapnik: Chunking algorithm • Osuna: Osuna algorithm • Platt: SMO algorithm • Only 2 samples per operation • Most popular

  15. From 2-category to multi-category SVM • No uniform way to extend • Common ways: • One-against-all • One-against-one: binary tree

  16. Advantages of SVM • Strong mathematical basis • Decision function and its complexity can be completely determined by SVs. • Training time does not depend on dimensionality of feature space, only on fixed input space. • Nice generalization • Insensitive to “curse of dimensionality” • Versatile choices of kernel function. • Feature-less classification • Kernel -> data-similarity measure

  17. Drawback of SVM • Still rely on knowledge • Choices of C, kernel and penalty function • C: how far the decision function is adapted to avoid any error • Kernel: how much freedom SVM should adapt itself (dimension) • Overlapping classes • Reductionism may discard promising SVs at any batch step. • The classification can be limited by the size of the problem. • No uniform way to extend 2-category to multi-category • “Still not an ideal optimally-generalizing classifier.”

  18. Applications • Vapnik et al. at AT&T: • Handwritten number recognition • Error rate is lower than that of ANN • Speech recognition • Face recognition • MIR • SVM-light: open source C library

  19. Application example of SVM in MIR • Li, Guo 2000: (Microsoft Research China) • Problem: • classify 16 classes of sounds in a database of 409 sounds • Features: • Concatenated perceptual and cepstral feature vectors. • Similarity measure: • Distance from boundary (SV-based boundary) • Evaluation: • Average retrieval accuracy • Average retrieval efficiency

  20. Application example of SVM in MIR • Details in applying SVM • Both linear and kernel-based approaches are tested • Kernel: Exponential Radial Basis Function • C: 200 • Randomly partition corpus into training/test sets. • One-against-one/binary tree in multi-category task. • Compared with other approaches • NFL: Nearest Feature Line, unsupervised approach • Muscle Fish: normalized Euclidean metric and nearest-neighbor

  21. Application example of SVM in MIR • Average error rates comparison • Different feature-set over different approaches Picture taken from (Li & Guo 2000)

  22. Application example of SVM in MIR • Complexity comparison • SVM: • Training: yes • Classification complexity: C * (C-1) / 2 (binary tree) • Inner-class complexity: number of SVs • NFL: • Training: no • Classification complexity: linear to number of classes • Inner-class complexity: Nc * (Nc-1) / 2

  23. Future work • Speed up quadratic programming • Choice of kernel functions • Find opportunities in solving impossible-so-far missions • Generalize the non-linear kernel approach to approaches other than SVM • Kernel PCA (principle component analysis)

  24. Bibliography • Summary: • http://www.music.mcgill.ca/~damonli/MUMT611/week9_summary.pdf • HTML bibliography: • http://www.music.mcgill.ca/~damonli/MUMT611/week9_bib.htm

More Related