1 / 77

A tutorial about SVM

A tutorial about SVM. Omer Boehm omerb@il.ibm.com. Outline. Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data. Introduction.

nanji
Download Presentation

A tutorial about SVM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A tutorial about SVM Omer Boehm omerb@il.ibm.com

  2. Outline • Introduction • Classification • Perceptron • SVM for linearly separable data. • SVM for almost linearly separable data. • SVM for non-linearly separable data.

  3. Introduction • A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data • An important task of machine learning is classification. • Classification is also referred to as pattern recognition.

  4. Example Classes Objects Learning Machine

  5. Types of learning problems • Supervised learning (n class, n>1) • Classification • Regression • Unsupervised learning (0 class) • Clustering (building equivalence classes) • Density estimation

  6. Supervised learning • Regression • Learn a continuous function from input samples • Stock prediction • Input – future date • Output – stock price • Training – information on stack price over last period • Classification • Learn a separation function from discrete inputs to classes. • Optical Character Recognition (OCR) • Input – images of digits. • Output – labeling 0-9. • Training - labeled images of digits. In fact, these are approximation problems

  7. Regression

  8. Classification

  9. Density estimation

  10. What makes learning difficult Given the following examples How should we draw the line?

  11. What makes learning difficult Which one is most appropriate?

  12. What makes learning difficult The hidden test points

  13. What is Learning (mathematically)? • We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification • Such an approximation is called a stable approximation • As a rule of thumb, small derivatives ensure stable approximation

  14. Stable vs. Unstable approximation • Lagrange approximation (unstable) given points, , we find the unique polynomial , that passes through the given points • Spline approximation (stable)given points, , we find a piecewise approximation by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition :

  15. What would be the best choice? • The “simplest” solution • A solution where the distance from each example is as small as possible and where the derivative is as small as possible

  16. Vector GeometryJust in case ….

  17. Dot product • The dot product of two vectors Is defined as: • An example

  18. Dot product • where denotes the length (magnitude) of ‘a’ • Unit vector

  19. Plane/Hyperplane • Hyperplane can be defined by: • Three points • Two vectors • A normal vector and a point

  20. Plane/Hyperplane • Let be a perpendicular vector to the hyperplane H • Let be the position vector of some known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to • Two vectors are perpendicular iff their dot product is zero • The hyperplane H can be expressed as

  21. Classification

  22. Solving approximation problems • First we define the family of approximating functions F • Next we define the cost function . This function tells how well performs the required approximation • Getting this done , the approximation/classification consists of solving the minimization problem • A first necessary condition (after Fermat) is • As we know it is always possible to do Newton-Raphson, and get a sequence of approximations

  23. Classification • A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories. • X is the input space, is a data point from an input space. • A typical input space is high-dimensional, for exampleX is also called a feature vector. • Ω is a finite set of categories to which the input data points belong : Ω ={1,2,…,C}. • are called labels.

  24. Classification • Y is a finite set of decisions – the output set of the classifier. • The classifier is a function

  25. The Perceptron

  26. Perceptron - FrankRosenblatt (1957) • Linear separation of the input space

  27. Perceptron algorithm • Start:The weight vector is generated randomly,set • Test: A vector is selected randomly,if and go to test, if and go to add, if and go to test, if and go to subtract • Add: go to test, • Subtract: go to test,

  28. Perceptron algorithm Shorter version Update rule for the k+1 iterations (iteration for each data point)

  29. Perceptron – visualization (intuition)

  30. Perceptron – visualization (intuition)

  31. Perceptron – visualization (intuition)

  32. Perceptron – visualization (intuition)

  33. Perceptron – visualization (intuition)

  34. Perceptron - analysis Solution is a linear combination of training points Only uses informative points (mistake driven) The coefficient of a point reflect its ‘difficulty’ The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)

  35. Support Vector Machines

  36. Advantages of SVM, Vladimir Vapnik 1979,1998 • Exhibit good generalization • Can implement confidence measures, etc. • Hypothesis has an explicit dependence on the data (via the support vectors) • Learning involves optimization of a convex function (no false minima, unlike NN). • Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).

  37. Advantages of SVM • From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. • These generalization bounds have two important features:

  38. Advantages of SVM • The upper bound on the generalization error does not depend on the dimensionality of the space. • The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.

  39. Basic scenario - Separable data set

  40. Basic scenario – define margin

  41. In an arbitrary-dimensional space, a separating hyperplane can be written : • Where W is the normal. • The decision function would be :

  42. Note argument in is invariant under a rescaling of the form • Implicitly the scale can be fixed by definingas the support vectors (canonical hyperplanes)

  43. The task is to select , so that the training data can be described as: for for • These can be combined into:

  44. The margin will be given by the projectionof the vector onto the normal vector to the hyperplane i.e. So the distance (Euclidian) can be formed

  45. Note that lies on i.e. • Similarly for • Subtracting the two results in

  46. The margin can be put as • Can convert the problem to subject to the constraints: • J(w) is a quadratic function, thus there is a single global minimum

  47. Lagrange multipliers • Problem definition :Maximize subject to • A new λ variable is used , called ‘Lagrange multiplier‘ to define

  48. Lagrange multipliers

  49. Lagrange multipliers

More Related