1 / 37

Classification: Support Vector Machine

Classification: Support Vector Machine. 10/10/07. What hyperplane (line) can separate the two classes of data?. What hyperplane (line) can separate the two classes of data?. But there are many other choices! Which one is the best?. M: margin.

Download Presentation

Classification: Support Vector Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification: Support Vector Machine 10/10/07

  2. What hyperplane (line) can separate the two classes of data?

  3. What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best?

  4. M: margin What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best?

  5. Optimal separating hyperplane M M The best hyperplane is the one that maximizes the margin, M.

  6. Computing the margin width A hyperplane is xTb + b0 = 1 Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to b. Then M = | x+ - x-| xTb + b0 = 0 xTb + b0 = -1 b x+ x-

  7. Computing the margin width A hyperplane is Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to b. Then M = | x+ - x-| Since x+Tb + b0 = 1 x-Tb + b0 = -1 (x+ - x-)Tb = 2 xTb + b0 = 1 xTb + b0 = 0 xTb + b0 = -1 b x+ x- M = | x+ - x-| = 2/| b |

  8. Computing the marginal width The hyperplane is separating if The maximizing problem is subject to M support vector

  9. Optimal separating hyperplane Rewrite the problem as subject to Lagrange function To minimize, set partial derivatives to be 0 Can be solved by quadratic programming.

  10. When the two classes are non-separable What is the best hyperplane? Idea: allow some points to lie on the wrong side, but not by much.

  11. Support vector machine When the two classes are not separable, the problem is slightly modified: Find subject to Can be solved using quadratic programming.

  12. Convert a nonseparable to separable case by nonlinear transformation non-separable in 1D

  13. Convert a nonseparable to separable case by nonlinear transformation separable in 1D

  14. Kernel function • Introduce nonlinear kernel functions h(x), and work on the transformed functions. Then the separating function is In fact, all you need is the kernel function: Common kernels:

  15. Applications

  16. Prediction of central nervous systems embryonic tumor outcome • 42 patient samples • 5 cancer types • Array contains 6817 genes • Question: are different tumors types distinguishable from gene expression pattern? (Pomeroy et al. 2002)

  17. (Pomeroy et al. 2002)

  18. Gene expressions within a cancer type cluster together (Pomeroy et al. 2002)

  19. PCA based on all genes (Pomeroy et al. 2002)

  20. PCA based on a subset of informational genes (Pomeroy et al. 2002)

  21. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks • Four different cancer types. • 88 samples • 6567 genes • Goal: to predict cancer types from gene expression data (Khan et al. 2001)

  22. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks (Khan et al. 2001)

  23. Procedures • Filter out genes that have low expression values (retain 2308 genes) • Dimension reduction by using PCA --- select top 10 principle components • 3 fold cross-validation: (Khan et al. 2001)

  24. Artificial Neural Network

  25. (Khan et al. 2001)

  26. Procedures • Filter out genes that have low expression values (retain 2308 genes) • Dimension reduction by using PCA --- select top 10 principle components • 3 fold cross-validation: • Repeat 1250 times. (Khan et al. 2001)

  27. (Khan et al. 2001)

  28. (Khan et al. 2001)

  29. Acknowledgement • Sources of slides: • Cheng Li • http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf • www.cse.msu.edu/~lawhiu/intro_SVM_new.ppt

  30. Aggregating predictors • Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets. • Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.

  31. AdaBoost • Step 1: Initialization the observation weights • Step 2: For m = 1 to M, • Fit a classifier Gm(X) to the training data using weight wi • Compute • Compute • Set • Step 3: Output misclassified obs are given more weights

  32. Boosting

  33. Optimal separating hyperplane • Substituting, we get the Lagrange (Wolf) dual function subject to To complete the steps, see Burges et al. • If then These xi’s are called the support vectors. is only determined by the support vectors

  34. Support vector machine The Lagrange function is Setting the partial derivatives to be 0. Substituting, we get Subject to

More Related