1 / 32

Support Vector Machines and ML on Documents

A comprehensive overview of Support Vector Machines (SVM) and machine learning techniques for document classification. Topics covered include linear and non-linear variants, separating hyperplanes, vector math, discriminant functions, and soft-margin classifiers.

mcelhaney
Download Presentation

Support Vector Machines and ML on Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines and ML on Documents Sampath Jayarathna Cal Poly Pomona Credit for some of the slides in this lecture goes to Prof. Mooney at UT Austin and Andrew Moore at CMU

  2. Recap’ of last week • Rocchioclassificaiton • Finds the “center of mass” for each class! • A new point x will be classified in class c if it is closest to the centroid for c

  3. Recap’ of last week • K-NN adopts a different approach • Rely on local decisions based on the closest neighbors! • k = number of neighbors to consider

  4. Linear vs Non-linear classification

  5. Support Vector Machines • State-of-the-art classificationmethod! • Both linear and non-linear variants! • Non-parametric& discriminative! • Good generalization properties (quite resistant to overfitting) • Extensionsformulticlassclassification, structuredprediction, regression

  6. What is a separating hyperplane? • An hyperplane is a generalization of a plane. • in one dimension, an hyperplane is called a point • in two dimensions, it is a line • in three dimensions, it is a plane • in more dimensions you can call it an hyperplane

  7. Some Vector Math! • The magnitude or length of a vector uis written∥u∥and is called its norm. Can easily calculated from the Pythagoras’ theorem. • The direction is the second component of a vector. • The direction of a vector u(u1,u2) is the vector  w(, () • The direction of point u(3,4) is the vector w(0.6,0.8) • norm is =1

  8. Some Vector Math! • You probably learnt that an equation of a line is y=ax+b. • Equation of an hyperplane is defined by :

  9. Nearest Neighbor Decision Tree Nonlinear Functions Linear Functions Discriminant Function • It can be arbitrary functions of x, such as:

  10. denotes +1 denotes -1 Linear Discriminant Function x2 • How would you classify these points using a linear discriminant function in order to minimize the error rate? • Infinite number of answers! x1

  11. denotes +1 denotes -1 Linear Discriminant Function x2 • How would you classify these points using a linear discriminant function in order to minimize the error rate? • Infinite number of answers! x1

  12. denotes +1 denotes -1 Linear Discriminant Function x2 • How would you classify these points using a linear discriminant function in order to minimize the error rate? • Infinite number of answers! x1

  13. denotes +1 denotes -1 Linear Discriminant Function x2 • How would you classify these points using a linear discriminant function in order to minimize the error rate? • Infinite number of answers! • Which one is the best? x1

  14. denotes +1 denotes -1 Hard-Margin Linear Classifier x2 Margin “safe zone” • The linear discriminant function (classifier) with the maximum margin is the best • Margin is defined as the width that the boundary could be increased by before hitting a data point • Why it is the best? • Robust to outliners and thus strong generalization ability x1

  15. denotes +1 denotes -1 Hard-Margin Linear Classifier x2 • Given a set of data points: , where x1

  16. denotes +1 denotes -1 x+ x+ x- Support Vectors Hard-Margin Linear Classifier x2 Margin • We know that wT x + b = 1 wT x + b = 0 wT x + b = -1 • The margin width is: • http://www.svm-tutorial.com/2015/06/svm-understanding-math-part-3/ n wT(x+ - x-) = 2 ρ= ||x+ - x-||2 x1

  17. denotes +1 denotes -1 x+ x+ x- Hard-Margin Linear Classifier x2 Margin • Formulation: wT x + b = 1 such that wT x + b = 0 wT x + b = -1 n x1

  18. Demo of LibSVM LibSVM : http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Simple Explanation: https://www.youtube.com/watch?v=5zRmhOUjjGY

  19. Hard-Margin Linear Classifier • Since we want to maximize the gap, we need to minimize • Or equivalently minimize • ½ is convenient for taking derivative later.

  20. denotes +1 denotes -1 wT x + b = 1 wT x + b = 0 wT x + b = -1 Soft-Margin Linear Classifier x2 • What if data is not linear separable? (noisy data, outliers, or the data is slightly non-linear) • Approach: Assign a “slack variable” to each instance ξi>0, which can be though of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. x1

  21. Soft-Margin Linear Classifier • When C is very large, the soft-margin SVM is equivalent to hard-margin SVM; • When C is very small, we admit misclassifications in the training data at the expense of having w-vector with small norm;

  22. Linear SVMs: Summary • The  classifier  is  a  separating hyperplane. • The  most  “important”  training  points  are  the  support vectors; they define the hyperplane. • Quadratic optimization algorithms  can  identify which training points are support vectors with non-zero Lagrangian multipliers. • Given  a  new  point  x,  we  can  score  its projection ontothehyperplane  normal:  

  23. x 0 • But what are we going to do if the dataset is just too hard? x 0 x2 x 0 Non-linear SVMs • Datasets that are linearly separable with noise work out great: • How about… mapping data to a higher-dimensional space:

  24. Non-linear SVMs: Feature Space • General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x→φ(x)

  25. Non-linear SVMs: Feature Space • If a linear decision surface does not exist, the data is mapped into much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very cleaver mathematical project (“kernel trick”)

  26. Popular kernels • A kernel is a dot product in some feature space: • Why use kernels? • Make non-separable problems separable. • Map data into better representational space

  27. Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with 2 attributes). • How to handle multiple classes • E.g., classify documents into three categories: sports, business, politics • Answer: one-vs-all, learn N SVM’s • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N”

  28. Generalization and overfitting • Generalization: A classifier or a regression algorithm learns to correctly predict output from given inputs not only in previously seen samples but also in previously unseen samples. • Overfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen samples but fails to do so in previously unseen samples. • Overfitting Poor generalization.

  29. Example of Generalization and overfitting • Algorithm 1 learned non-reproducible peculiarities of the specific sample available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization. • Algorithm 2 learned general characteristics of the function that produced the data. Thus, it generalizes.

  30. “Loss + penalty” to avoid overfitting • Many statistical learning algorithms (including SVMs) search for a decision function by solving the following optimization problem: Minimize(Loss+ λPenalty) • Lossmeasures error of fitting the data • Penaltypenalizes complexity of the learned function • λis regularization parameter that balances Lossand Penalty

  31. Classification: Practical aspects • Assembling data resources is often the real bottleneck in classification • Collect, store, organize, quality-check the data • Financial and legal aspects (ownership, privacy) • ML-basedclassifiersmustsometimesbe overlaid with hand-crafted rules! • To enforce particular business rules, or allow the user to control the classification

  32. Classification: Practical aspects • Whichfeatures to use? • Designing the right features is often key to success • If too few features: not informative enough • If too many features: data sparseness • In text classification, the most basic features are thedocumentterms: • But preprocessing is important to filter/modify some tokens • Other features, such as document length, zones, links, etc.

More Related