1 / 53

Support Vector Machines & Kernel Machines

Support Vector Machines & Kernel Machines. IP Seminar 2008 IDC Herzliya. Introduction To Support Vector Machines (SVM).

ciara-bowen
Download Presentation

Support Vector Machines & Kernel Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya Ohad Hageby IDC 2008

  2. Introduction To Support Vector Machines (SVM) Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. (from Wikipedia) Ohad Hageby IDC 2008

  3. Introduction Continued • Often we are interested in classifying data as a part of a machine-learning process. • Each data point will be represented by a p-dimensional vector (a list of p numbers). • Each of these data points belongs to only one of two classes. Ohad Hageby IDC 2008

  4. Training Data • We want to estimate a function f:RN{+1,-1}, using input-output training data pairs generated independent and identically distributed according to unknown P(x,y) • If f(xi)=-1 xi is in class 1 • If f(xi)=1 xi is in class 2 Ohad Hageby IDC 2008

  5. The machine • The machine task is to learn the mapping of xi to yi. • It is defined by a set of possible mappings: xf(x) Ohad Hageby IDC 2008

  6. Expected Error • The test examples assumed to be of the same probability distribution as the training data P(x,y). • The best function f we could have is one minimizing the expected error (risk). Ohad Hageby IDC 2008

  7. I denotes the “loss” function (“0/1 loss”) • A common loss function is the squared loss: Ohad Hageby IDC 2008

  8. Empirical Risk • Unfortunately the risk cannot be minimize directly due to the unknown probability distribution. • “empirical risk” is defined to be just the measured mean error rate on the training set (for a fixed, finite number of observations) Ohad Hageby IDC 2008

  9. The overfitting dilemma • It is possible to give conditions on the learning machine which will ensure that when n∞ Remp will converge toward Rexpected. • For small sample size overfitting might occur Ohad Hageby IDC 2008

  10. The overfitting dilemma cont. From “An introduction to Kernel Based Learning Algorithms” Ohad Hageby IDC 2008

  11. VC Dimension • A concept in “VC Theory” introduces by Vladimir Vapnik and Alexey Chervonenkis. • Measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. Ohad Hageby IDC 2008

  12. Shattering Example From Wikipedia For example, consider a straight line as the classification model: the model used by a perceptron. The line should separate positive data points from negative data points. When there are 3 points that are not collinear, the line can shatter them. Ohad Hageby IDC 2008

  13. Shattering • A classification model f with some parameter vector θ is said to shatter a set of data points (x1,x2,…,xn) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points. Ohad Hageby IDC 2008

  14. Shattering Continued • VC dimension of a model f is the maximum h such that some data point set of cardinality h can be shattered by f. • The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model. Ohad Hageby IDC 2008

  15. Upper Bound on Error • In our case the upper bound on the training error is given by (Vapnik, 1995): • For all δ>0 and f∊F: Ohad Hageby IDC 2008

  16. Theorem: VC Dimension in Rn • The VC dimension of the set of oriented hyperplanes in Rn is n+1 since we can always choose n+1 points and then choose one of the points as origin s.t. the position vectors of the remaining n points are linearly independent. But we can never choose n+2 such points. (Anthony and Biggs, 1995) Ohad Hageby IDC 2008

  17. Structural Risk Minimization • Taking too many training points and the model may be “too tight” and predict poorly on new test points. Too little, may not be enough to learn. • One way to avoid overfitting dilemma is to limit the complexity of the function class F that we choose function f from. • Intuition: “Simple” (e.g. linear) function that explains most of the data is preferable to a complex one (Occum’s razor) Ohad Hageby IDC 2008

  18. From “An introduction to Kernel Based Learning Algorithms” Ohad Hageby IDC 2008

  19. The Support Vector Machine Linear Case • In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) such that: • Because the set of training data is finite there is a family of such hyperplanes. • We would like to maximize the distance (margin) of each class points from the separating plane. • We could scale w and b such that: Ohad Hageby IDC 2008

  20. SVM – Linear Case • Linear separating hyperplanes. The support vectors are the ones used to find the hyperplane (circled). Ohad Hageby IDC 2008

  21. Important observations • Only a small part of the training set is used to build the hyperplane (the support vectors). • At least one point at every side of the hyperplane achieve the equality: • For two such opposite points with minimal distance: Ohad Hageby IDC 2008

  22. Reformulating as quadratic optimization problem • This means that maximizing the distance is the same as minimizing ½|w|2: Ohad Hageby IDC 2008

  23. Solving the SVM • We can solve by introducing Lagrange multipliers αi to obtain the Lagrangian which should be minimized with respect to w and b and maximized with respect to αi (Karush-Kuhn-Tucker conditions) Ohad Hageby IDC 2008

  24. Solving the SVM Cont. • A little manipulation leads to the requirement of: • Note! We expect most αi to be zero. Those which aren’t represent the support vectors. Ohad Hageby IDC 2008

  25. The dual Problem Ohad Hageby IDC 2008

  26. SVM - Non linear case • Not always the dataset is linearly separable! Ohad Hageby IDC 2008

  27. Ohad Hageby IDC 2008

  28. Mapping F to higher dimension • We need a function Ф(x)=x’ to map x to a higher dimension feature space. Ohad Hageby IDC 2008

  29. Mapping F to higher dimension • Pro: In many problems we can linearly separate when feature space is of higher dimension. • Con: mapping to a higher dimension is computationally complex! “The curse of dimensionality” (in statistics) tells us we will need to sample exponentially much more data! • Is that really so? Ohad Hageby IDC 2008

  30. Mapping F to higher dimension • Statistical Learning theory tells us that learning in F can be simpler if one uses low complexity decision rules (like linear classifier). • In short, not the dimensionality but the complexity of the function class matters. • Fortunately, for some feature spaces and their mapping Ф we can use a trick! Ohad Hageby IDC 2008

  31. The “Kernel Trick” • Kernel function map data vectors to feature space with higher dimension (like the Ф we are looking for). • Some kernel functions has unique property and they can be used to directly calculate the scalar product in the feature space. Ohad Hageby IDC 2008

  32. Kernel Trick Example • Given the following kernel function Ф, we will take x and y vectors in R2, and see how we calculate the kernel function K(x,y) using dot product of Ф(x)Ф(y): Ohad Hageby IDC 2008

  33. Conclusion: We do not have to calculate Ф every time to calculate k(x,y)! It’s a straightforward dot product calculation of x and y. Ohad Hageby IDC 2008

  34. Moving back to SVM in the higher Dimension • The Lagrangian will be: At the optimal point – “saddle point equations”: Which translate to: Ohad Hageby IDC 2008

  35. And the optimization problem Ohad Hageby IDC 2008

  36. The Decision Function • Solving the (dual) optimization problem leads to the non-linear decision function Ohad Hageby IDC 2008

  37. The non separable case • We considered until now the separable case which is consistent with empirical error zero. • For noisy data this may not be the minimum in the expected risk (overfitting!) • Solution: using “slack variables” to relax the hard margin constraints: Ohad Hageby IDC 2008

  38. We have now to also minimize upper bound on the empirical risk Ohad Hageby IDC 2008

  39. And the dual problem Ohad Hageby IDC 2008

  40. Examples Kernel Functions • Polynomials • Gaussians • Sigmoids • Radial Basis Functions • … Ohad Hageby IDC 2008

  41. Example of an SV classifier found using RBF: Kernel k(x,x’)=exp(-||x-x’||2). Here the input space is X=[-1,1]2 Taken from Bill Freeman’s Notes Ohad Hageby IDC 2008

  42. Part 2Gender Classification with SVMs Ohad Hageby IDC 2008

  43. The Goal • Learning to classify pictures according to their gender (Male/Female) when only the facial features appear (almost no hair) Ohad Hageby IDC 2008

  44. The experiment • Faces were processed from FERET database pictures to be consistent with the requirement of the experiment Ohad Hageby IDC 2008

  45. The experiment • SVM performance compared with: • Linear classifier • Quadratic classifier • Fisher Linear Discriminant • Nearest Neighbor Ohad Hageby IDC 2008

  46. The experiment Cont. • The experiment was conducted on two sets of data: high and low resolution (of the same) pictures, a performance comparison was made. • The goal was to learn the minimal required data for a classifier to classify gender. • Performance of 30 humans was used as well for comparison. • The data: 1755 pictures 711 females and 1044 males. Ohad Hageby IDC 2008

  47. Training Data • 80 by 40 pixel images for the “high resolution” • 21 by 12 pixel for the thumbnails • For each classifier estimation with 5-fold cross validation. (4/5 training and 1/5 testing) Ohad Hageby IDC 2008

  48. Support Faces Ohad Hageby IDC 2008

  49. Results on Thumbnails Ohad Hageby IDC 2008

  50. Human Error Rate Ohad Hageby IDC 2008

More Related