1 / 52

Support Vector Machines

Summer Course: Data Mining. Support Vector Machines. Support Vector Machines and other penalization classifiers. Presenter: Georgi Nalbantov. Presenter: Georgi Nalbantov. August 2009. Contents. Purpose Linear Support Vector Machines Nonlinear Support Vector Machines

Download Presentation

Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Course: Data Mining Support Vector Machines Support Vector Machinesand other penalization classifiers Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov August 2009

  2. Contents • Purpose • Linear Support Vector Machines • Nonlinear Support Vector Machines • (Theoretical justifications of SVM) • Marketing Examples • Other penalization classification methods • Conclusion and Q & A • (some extensions)

  3. Purpose • Task to be solved (The Classification Task): Classify cases (customers) into “type 1” or “type 2” on the basis of some known attributes (characteristics) • Chosen tool to solve this task:Support Vector Machines

  4. The Classification Task • Given data on explanatory and explained variables, where the explained variable can take two values {  1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases: Given: ( x1, y1 ), … , ( xm , ym )  n {  1 } Find:  : n  {  1 } “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k )is minimal • Existing techniques to solve the classification task: • Linear and Quadratic Discriminant Analysis • Logit choice models (Logistic Regression) • Decision trees, Neural Networks, Least Squares SVM

  5. Support Vector Machines: Definition • Support Vector Machines are a non-parametric tool for classification/regression • Support Vector Machines are used for prediction rather than description purposes • Support Vector Machines have been developed by Vapnik and co-workers

  6. ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● ● ∆ ● ∆ ● ● ● ● ● ● Linear Support Vector Machines • A direct marketing company wants to sell a new book: “The Art History of Florence” • Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003). • Problem: How to identify buyers and non-buyers using the two variables: • Months since last purchase • Number of art books purchased ∆buyers ● non-buyers Number of art books purchased Months since last purchase

  7. ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • Main idea of SVM:separate groups by a line. • However: There are infinitely many lines that have zero training error… • … which line shall we choose? ∆buyers ● non-buyers Number of art books purchased Months since last purchase

  8. ∆ ∆ margin ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • SVM use the idea of a margin around the separating line. • The thinner the margin, • the more complex the model, • The best line is the one with thelargest margin. ∆buyers ● non-buyers Number of art books purchased Months since last purchase

  9. ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • The line having the largest margin is:w1x1 + w2x2 + b= 0 • Where • x1 = months since last purchase • x2 = number of art books purchased • Note: • w1xi 1 + w2xi 2 + b +1 for i  ∆ • w1xj 1 + w2xj 2 + b –1 for j  ● x2 w1x1 + w2x2 + b= 1 w1x1 + w2x2 + b= 0 w1x1 + w2x2 + b= -1 Number of art books purchased margin x1 Months since last purchase

  10. ∆ ∆ ∆ ∆ ∆ ● ● ● maximize the margin minimize minimize ● ● ● ● ● Linear SVM: Separable Case • The width of the margin is given by: • Note: x2 w1x1 + w2x2 + b= 1 w1x1 + w2x2 + b= 0 w1x1 + w2x2 + b= -1 Number of art books purchased margin x1 Months since last purchase

  11. ∆ ∆ ∆ ∆ ∆ ● ● ● minimize maximize the margin minimize ● ● ● ● ● Linear SVM: Separable Case • The optimization problem for SVM is: • subject to: • w1xi 1 + w2xi 2 + b +1for i  ∆ • w1xj 1 + w2xj 2 + b –1 for j  ● x2 margin x1

  12. ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • “Support vectors” are those points that lie on the boundaries of the margin • The decision surface (line) is determined only by the support vectors. All other points are irrelevant “Support vectors” x2 x1

  13. Non-separable case: there is no line separating errorlessly the two groups Here, SVM minimizeL(w,C) : subject to: w1xi 1 + w2xi 2 + b +1 –ifor i  ∆ w1xj 1 + w2xj 2 + b –1 +i for j  ● I,j  0 maximize the margin minimize thetraining errors Linear SVM: Nonseparable Case Training set: 1000 targeted customers x2 ∆buyers ● non-buyers w1x1 + w2x2 + b= 1 ∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● L(w,C) = Complexity + Errors ● ∆ ● ∆ ● ● ● ● ● ● x1

  14. ∆ ∆ ∆ ∆ ∆ ● ● ∆ ∆ ● ● ● ● ● ● ● ● x2 C = 1 x1 Linear SVM: The Role of C x2 C = 5 ∆ ∆ x1 • Bigger C • Smaller C decreased complexity increased complexity ( widermargin ) ( thinnermargin ) bigger number errors smaller number errors ( worse fit on the data ) ( better fit on the data ) • Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors

  15. Bias – Variance trade-off

  16. From Regression into Classification • We have a linear model, such as • We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). • The training data set we have consists of only MANY observations, for instance: Training data:

  17. From Regression into Classification • We have a linear model, such as y • We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). 1 • The training data set we have consists of only MANY observations, for instance: -1 Training data: x x Support vector Support vector “margin”

  18. From Regression into Classification:Support Vector Machines • flatter line  greater penalization y equivalently: 1 • smaller slope  bigger margin -1 x x “margin”

  19. From Regression into Classification:Support Vector Machines y x2 x2 x1 x1 “margin” • flatter line  greater penalization equivalently: • smaller slope  bigger margin

  20. ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● ● ∆ ● ∆ ● ● ● ● ● ● Nonlinear SVM: Nonseparable Case • Mapping into a higher-dimensional space • Optimization task: minimizeL(w,C) • subject to: • ∆ • ● x2 x1

  21. ∆ ● x2 ● ● ∆ (-1,1) (1,1) ∆ ● (-1,-1) Nonlinear SVM: Nonseparable Case • Map the data into higher-dimensional space: 2 3 ∆ x1 ● (1,-1)

  22. ∆ ● x2 ● ● ∆ (-1,1) (1,1) ∆ ● (-1,-1) Nonlinear SVM: Nonseparable Case • Find the optimal hyperplane in the transformed space ∆ x1 ● (1,-1)

  23. ∆ ● ● Nonlinear SVM: Nonseparable Case • Observe the decision surface in the original space (optional) x2 ∆ ● ∆ x1 ● ∆ ●

  24. Nonlinear SVM: Nonseparable Case • Dual formulation of the (primal) SVM minimization problem Primal Dual Subject to Subject to

  25. Nonlinear SVM: Nonseparable Case • Dual formulation of the (primal) SVM minimization problem Dual Subject to (kernel function)

  26. Nonlinear SVM: Nonseparable Case • Dual formulation of the (primal) SVM minimization problem Dual Subject to (kernel function)

  27. Strengths and Weaknesses of SVM • Strengths of SVM: • Training is relatively easy • No local minima • It scales relatively well to high dimensional data • Trade-off between classifier complexity and error can be controlled explicitly via C • Robustness of the results • The “curse of dimensionality” is avoided • Weaknesses of SVM: • What is the best trade-off parameter C ? • Need a good transformation of the original space

  28. The Ketchup Marketing Problem • Two types of ketchup: Heinz and Hunts • Seven Attributes • Feature Heinz • Feature Hunts • Display Heinz • Display Hunts • Feature&Display Heinz • Feature&Display Hunts • Log price difference between Heinz and Hunts • Training Data: 2498 cases (89.11% Heinz is chosen) • Test Data: 300 cases (88.33% Heinz is chosen)

  29. The Ketchup Marketing Problem • Choose a kernel mapping: Cross-validation mean squared errors, SVM with RBF kernel Linear kernel Polynomial kernel RBF kernel • Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ) C min max σ

  30. The Ketchup Marketing Problem – Training Set

  31. The Ketchup Marketing Problem – Training Set

  32. The Ketchup Marketing Problem – Training Set

  33. The Ketchup Marketing Problem – Training Set

  34. The Ketchup Marketing Problem – Test Set

  35. The Ketchup Marketing Problem – Test Set

  36. The Ketchup Marketing Problem – Test Set

  37. Part II Penalized classification and regression methods • Support Hyperplanes • Nearest Convex Hull classifier • Soft Nearest Neighbor • Application: An example Support Vector Regression financial study • Conclusion

  38. + • + • + • + • + • + • + • + • + • + • + • + Classification: Support Hyperplanes • There are infinitely many hyperplanes that are semi-consistent (= commit no error) with the training data. • Consider a (separable) binary classification case: training data (+,-) and a test point x.

  39. + • + • + • + • + • + Classification: Support Hyperplanes • + • Support hyperplaneof x • + • + • + • + • + • For the classification of the test point x, use the farthest-away h-plane that is semi-consistent with training data. • The SH decision surface. Each point on it has 2 support h-planes.

  40. + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + Classification: Support Hyperplanes • Toy Problem Experiment with Support Hyperplanes and Support Vector Machines

  41. Classification: Support Vector Machines and Support Hyperplanes Support Vector Machines Support Hyperplanes

  42. Classification: Support Vector Machines and Nearest Convex Hull cl. Support Vector Machines Nearest Convex Hull classification

  43. Classification: Support Vector Machines and Soft Nearest Neighbor Support Vector Machines Soft Nearest Neighbor

  44. Classification: Support Hyperplanes Support Hyperplanes (bigger penalization) Support Hyperplanes

  45. Classification: Nearest Convex Hull classification Nearest Convex Hull classification (bigger penalization) Nearest Convex Hull classification

  46. Classification: Soft Nearest Neighbor Soft Nearest Neighbor (bigger penalization) Soft Nearest Neighbor

  47. Classification: Support Vector Machines, Nonseparable Case Support Vector Machines

  48. Classification: Support Hyperplanes, Nonseparable Case Support Hyperplanes

  49. Classification: Nearest Convex Hull classification, Nonseparable Case Nearest Convex Hull classification

  50. Classification: Soft Nearest Neighbor, Nonseparable Case Soft Nearest Neighbor

More Related