1 / 36

Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Contents. Maximum Margin Classifiers Overlapping Class Distributions Relation to Logistic Regression

Download Presentation

Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch 7. Sparse Kernel MachinesPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

  2. Contents • Maximum Margin Classifiers • Overlapping Class Distributions • Relation to Logistic Regression • Multiclass SVMs • SVMs for Regression • Relevance Vector Machines • RVM for Regression • Analysis of Sparsity • RVMs for Classification (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  3. Maximum Margin Classifiers • Problem settings • Two-class classification using linear models • Assume that training data set is linearly separable • Support vector machine approaches • The decision boundary is chosen to be the one for which the margin is maximized support vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  4. Maximum Margin Solution • For all data points, • The distance of a point to the decision surface • The maximum margin solution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  5. Dual Representation • Introducing Lagrange multipliers, • Min. points satisfy the derivatives of L w.r.t. w and b equal 0 • Dual representation Find Appendix E for more details (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  6. Classifying New Data • Optimization subjects to • Found by solving a quadratic programming problem Karush-Kuhn-Tucker (KKT) Conditions  Appendix E : support vectors or • O(N3) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  7. Example of Separable Data Classification Figure 7.2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  8. Overlapping Class Distributions • Allow some misclassified examples  soft margin • Introduce slack variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  9. Soft Margin Solution • Minimize • KKT conditions: : trade-off between minimizing training errors and controlling model complexity or (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ : support vectors

  10. Dual Representation • Dual representation • Classifying new data and obtaining b ( hard margin classifiers) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  11. Alternative Formulation • v-SVM (Schölkopf et al., 2000) - Upper bound on the fraction of margin errors - Lower bound on the fraction of support vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  12. Example of Nonseparable Data Classification (v-SVM) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  13. Solutions of the QP Problem • Chunking (Vapnik, 1982) • Idea: the value of Lagrangian is unchanged if we remove the rows and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero • Protected conjugate gradients (Burges, 1998) • Decomposition methods (Osuna et al., 1996) • Sequential minimal optimization (Platt, 1999) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  14. Relation to Logistic Regression (Section 4.3.2) • For data points on the correct side, For the remaining points, : hinge error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  15. Relation to Logistic Regression (Cont’d) • From maximum likelihood logistic regression • Error function with a quadratic regularizer (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  16. Comparison of Error Functions Hinge error function Error function for logistic regression Misclassification error Squared error (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  17. Multiclass SVMs • One-versus-the-rest: K separate SVMs • Can lead inconsistent results (Figure 4.2) • Imbalanced training sets • Positive class: +1, negative class: -1/(K-1) (Lee et al., 2001) • An objective function for training all SVMs simultaneously (Weston and Watkins, 1999) • One-versus-one: K(K-1)/2 SVMs • Based on error-correcting output codes (Allwein et al., 2000) • Generalization of the voting scheme of the one-versus-one (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  18. SVMs for Regression • Simple linear regression: minimize • ε-insensitive error function ε-insensitive error function quadratic error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  19. SVMs for Regression (Cont’d) • Minimize (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  20. Dual Problem (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  21. Predictions • KKT conditions: (from derivatives of the Lagrangian) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  22. Alternative Formulation • v-SVM (Schölkopf et al., 2000) fraction of points lying outside the tube (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  23. Example of v-SVM Regression (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  24. Relevance Vector Machines • SVM • Outputs are decisions rather than posterior probabilities • The extension to K>2 classes is problematic • There is a complexity parameter C • Kernel functions are centered on training data points and required to be positive definite • RVM • Bayesian sparse kernel technique • Much sparser models • Faster performance on test data (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  25. RVM for Regression • RVM is a linear form in Chapter 3 with a modified prior (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  26. RVM for Regression (Cont’d) From the result (3.49) for linear regression models • α and β are determined using evidence approximation (type-2 maximum likelihood) (Section 3.5) Maximize (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  27. RVM for Regression (Cont’d) • Two approaches • By derivatives of marginal likelihood • EM algorithm  Section 9.3.4 • Predictive distribution Section 3.3.2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  28. Example of RVM Regression v-SVM regression RVM regression • More compact than SVM • Parameters are determined automatically • Require more training time than SVM (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  29. Mechanism for Sparsity only isotropic noise, α = ∞ a finite value of α (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  30. Sparse Solution • Pull out the contribution from αi in Using (C.7), (C.15) in Appendix C (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  31. Sparse Solution (Cont’d) • For log marginal likelihood function L, • Stationary points of the marginal likelihood w.r.t.αi Sparsity: measures the extent to which overlaps with the other basis vectors Quality of : represents a measure of the alignment of the basis vector with the error between t and y-i (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  32. Sequential Sparse Bayesian Learning Algorithm • Initialize • Initialize using , with , with the remaining • Evaluate and for all basis functions • Select a candidate • If ( is already in the model), update • If , add to the model, and evaluate • If , remove from the model, and set • Update • Go to 3 until converged (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  33. RVM for Classification • Probabilistic linear classification model (Chapter 4) • with ARD prior - Initialize - Build a Gaussian approximation to the posterior distribution - Obtain an approximation to the marginal likelihood - Maximize the marginal likelihood (re-estimate ) until converged (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  34. RVM for Classification (Cont’d) • The posterior distribution is obtained by maximizing Iterative reweighted least squares (IRLS) from Section 4.3.3 Resulting Gaussian approximation to the posterior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  35. RVM for Classification (Cont’d) • Marginal likelihood using Laplace approximation (Section 4.4) • Set the derivative of the marginal likelihood equal to zero, and rearranging then gives • If we define , Same in the regression case (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  36. Example of RVM Classification (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

More Related