Support Vector Machines

Summer Course: Data Mining Support Vector Machines Support Vector Machinesand other penalization classifiers Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov August 2009

Contents • Purpose • Linear Support Vector Machines • Nonlinear Support Vector Machines • (Theoretical justifications of SVM) • Marketing Examples • Other penalization classification methods • Conclusion and Q & A • (some extensions)

Purpose • Task to be solved (The Classification Task): Classify cases (customers) into “type 1” or “type 2” on the basis of some known attributes (characteristics) • Chosen tool to solve this task:Support Vector Machines

The Classification Task • Given data on explanatory and explained variables, where the explained variable can take two values {  1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases: Given: ( x1, y1 ), … , ( xm , ym )  n {  1 } Find:  : n  {  1 } “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k )is minimal • Existing techniques to solve the classification task: • Linear and Quadratic Discriminant Analysis • Logit choice models (Logistic Regression) • Decision trees, Neural Networks, Least Squares SVM

Support Vector Machines: Definition • Support Vector Machines are a non-parametric tool for classification/regression • Support Vector Machines are used for prediction rather than description purposes • Support Vector Machines have been developed by Vapnik and co-workers

∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● ● ∆ ● ∆ ● ● ● ● ● ● Linear Support Vector Machines • A direct marketing company wants to sell a new book: “The Art History of Florence” • Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003). • Problem: How to identify buyers and non-buyers using the two variables: • Months since last purchase • Number of art books purchased ∆buyers ● non-buyers Number of art books purchased Months since last purchase

∆ ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • Main idea of SVM:separate groups by a line. • However: There are infinitely many lines that have zero training error… • … which line shall we choose? ∆buyers ● non-buyers Number of art books purchased Months since last purchase

∆ ∆ ∆ margin ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • SVM use the idea of a margin around the separating line. • The thinner the margin, • the more complex the model, • The best line is the one with thelargest margin. ∆buyers ● non-buyers Number of art books purchased Months since last purchase

∆ ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • The line having the largest margin is:w1x1 + w2x2 + b= 0 • Where • x1 = months since last purchase • x2 = number of art books purchased • Note: • w1xi 1 + w2xi 2 + b +1 for i  ∆ • w1xj 1 + w2xj 2 + b –1 for j  ● x2 w1x1 + w2x2 + b= 1 w1x1 + w2x2 + b= 0 w1x1 + w2x2 + b= -1 Number of art books purchased margin x1 Months since last purchase

∆ ∆ ∆ ∆ ∆ ∆ ● ● ● maximize the margin minimize minimize ● ● ● ● ● Linear SVM: Separable Case • The width of the margin is given by: • Note: x2 w1x1 + w2x2 + b= 1 w1x1 + w2x2 + b= 0 w1x1 + w2x2 + b= -1 Number of art books purchased margin x1 Months since last purchase

∆ ∆ ∆ ∆ ∆ ∆ ● ● ● minimize maximize the margin minimize ● ● ● ● ● Linear SVM: Separable Case • The optimization problem for SVM is: • subject to: • w1xi 1 + w2xi 2 + b +1for i  ∆ • w1xj 1 + w2xj 2 + b –1 for j  ● x2 margin x1

∆ ∆ ∆ ∆ ∆ ∆ ● ● ● ● ● ● ● ● Linear SVM: Separable Case • “Support vectors” are those points that lie on the boundaries of the margin • The decision surface (line) is determined only by the support vectors. All other points are irrelevant “Support vectors” x2 x1

Non-separable case: there is no line separating errorlessly the two groups Here, SVM minimizeL(w,C) : subject to: w1xi 1 + w2xi 2 + b +1 –ifor i  ∆ w1xj 1 + w2xj 2 + b –1 +i for j  ● I,j  0 maximize the margin minimize thetraining errors Linear SVM: Nonseparable Case Training set: 1000 targeted customers x2 ∆buyers ● non-buyers w1x1 + w2x2 + b= 1 ∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● L(w,C) = Complexity + Errors ● ∆ ● ∆ ● ● ● ● ● ● x1

∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ∆ ● ● ● ● ● ● ● ● x2 C = 1 x1 Linear SVM: The Role of C x2 C = 5 ∆ ∆ x1 • Bigger C • Smaller C decreased complexity increased complexity ( widermargin ) ( thinnermargin ) bigger number errors smaller number errors ( worse fit on the data ) ( better fit on the data ) • Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors

Bias – Variance trade-off

From Regression into Classification • We have a linear model, such as • We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). • The training data set we have consists of only MANY observations, for instance: Training data:

From Regression into Classification • We have a linear model, such as y • We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). 1 • The training data set we have consists of only MANY observations, for instance: -1 Training data: x x Support vector Support vector “margin”

From Regression into Classification:Support Vector Machines • flatter line  greater penalization y equivalently: 1 • smaller slope  bigger margin -1 x x “margin”

From Regression into Classification:Support Vector Machines y x2 x2 x1 x1 “margin” • flatter line  greater penalization equivalently: • smaller slope  bigger margin

∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● ● ∆ ● ∆ ● ● ● ● ● ● Nonlinear SVM: Nonseparable Case • Mapping into a higher-dimensional space • Optimization task: minimizeL(w,C) • subject to: • ∆ • ● x2 x1

∆ ∆ ● x2 ● ● ∆ (-1,1) (1,1) ∆ ● (-1,-1) Nonlinear SVM: Nonseparable Case • Map the data into higher-dimensional space: 2 3 ∆ x1 ● (1,-1)

∆ ∆ ● x2 ● ● ∆ (-1,1) (1,1) ∆ ● (-1,-1) Nonlinear SVM: Nonseparable Case • Find the optimal hyperplane in the transformed space ∆ x1 ● (1,-1)

∆ ∆ ● ● Nonlinear SVM: Nonseparable Case • Observe the decision surface in the original space (optional) x2 ∆ ● ∆ x1 ● ∆ ●

Nonlinear SVM: Nonseparable Case • Dual formulation of the (primal) SVM minimization problem Primal Dual Subject to Subject to

Nonlinear SVM: Nonseparable Case • Dual formulation of the (primal) SVM minimization problem Dual Subject to (kernel function)

Strengths and Weaknesses of SVM • Strengths of SVM: • Training is relatively easy • No local minima • It scales relatively well to high dimensional data • Trade-off between classifier complexity and error can be controlled explicitly via C • Robustness of the results • The “curse of dimensionality” is avoided • Weaknesses of SVM: • What is the best trade-off parameter C ? • Need a good transformation of the original space

The Ketchup Marketing Problem • Two types of ketchup: Heinz and Hunts • Seven Attributes • Feature Heinz • Feature Hunts • Display Heinz • Display Hunts • Feature&Display Heinz • Feature&Display Hunts • Log price difference between Heinz and Hunts • Training Data: 2498 cases (89.11% Heinz is chosen) • Test Data: 300 cases (88.33% Heinz is chosen)

The Ketchup Marketing Problem • Choose a kernel mapping: Cross-validation mean squared errors, SVM with RBF kernel Linear kernel Polynomial kernel RBF kernel • Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ) C min max σ

The Ketchup Marketing Problem – Training Set

The Ketchup Marketing Problem – Test Set

Part II Penalized classification and regression methods • Support Hyperplanes • Nearest Convex Hull classifier • Soft Nearest Neighbor • Application: An example Support Vector Regression financial study • Conclusion

+ • + • + • + • + • + • + • + • + • + • + • + Classification: Support Hyperplanes • There are infinitely many hyperplanes that are semi-consistent (= commit no error) with the training data. • Consider a (separable) binary classification case: training data (+,-) and a test point x.

+ • + • + • + • + • + Classification: Support Hyperplanes • + • Support hyperplaneof x • + • + • + • + • + • For the classification of the test point x, use the farthest-away h-plane that is semi-consistent with training data. • The SH decision surface. Each point on it has 2 support h-planes.

+ • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + • + Classification: Support Hyperplanes • Toy Problem Experiment with Support Hyperplanes and Support Vector Machines

Classification: Support Vector Machines and Support Hyperplanes Support Vector Machines Support Hyperplanes

Classification: Support Vector Machines and Nearest Convex Hull cl. Support Vector Machines Nearest Convex Hull classification

Classification: Support Vector Machines and Soft Nearest Neighbor Support Vector Machines Soft Nearest Neighbor

Classification: Support Hyperplanes Support Hyperplanes (bigger penalization) Support Hyperplanes

Classification: Nearest Convex Hull classification Nearest Convex Hull classification (bigger penalization) Nearest Convex Hull classification

Classification: Soft Nearest Neighbor Soft Nearest Neighbor (bigger penalization) Soft Nearest Neighbor

Classification: Support Vector Machines, Nonseparable Case Support Vector Machines

Classification: Support Hyperplanes, Nonseparable Case Support Hyperplanes

Classification: Nearest Convex Hull classification, Nonseparable Case Nearest Convex Hull classification

Classification: Soft Nearest Neighbor, Nonseparable Case Soft Nearest Neighbor

Support Vector Machines