1 / 82

SVM and SVR as Convex Optimization Techniques

SVM and SVR as Convex Optimization Techniques. Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205. Acknowledgement. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. Kenji Fukumizu Institute of Statistical Mathematics, ROIS

fergus
Download Presentation

SVM and SVR as Convex Optimization Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVM and SVR as Convex Optimization Techniques Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205

  2. Acknowledgement Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies • GeorgiNalbantov • Econometric Institute, School of Economics, Erasmus University Rotterdam

  3. Contents • Glimpses of Historical Development • Optimal Separating Hyperplane • Soft Margin Support Vector Machine • Support Vector Regression • Convex Optimization • Use of Lagrange and Duality Theory • Example • Conclusion

  4. Early History • In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century. • In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that • A solution exists • The solution is unique • The solution depends continuously on the data, in some reasonable topology • ( Well-Posed Problem)

  5. Early History • In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative. • During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics. • Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model. • Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive. Let Us See What KM present…………….

  6. 6 Recent History • Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then. • Result: a class of algorithms for Pattern Recognition • (Kernel Machines) • Now: a large and diverse community, from machine • learning, optimization, statistics, neural networks, • functional analysis, etc • Centralized website: www.kernel-machines.org • First Text book (2000): see www.support-vector.net • Now ( 2012): At least twenty books of different taste are avialable in international market • The book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years.

  7. We consider linear combinations of input vector: 7 Kernel methods: Heuristic View The common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. (Ridge ) regression 3. Fisher discriminant analysis 4. Canonical correlation analysis 5.Singular value decomposition 6. Independent component analysis We make use concepts of length and dot product available in Euclidean space.

  8. Linear learning typically has nice properties Unique optimal solutions, Fast learning algorithms Better statistical analysis But one big problem Insufficient capacity 8 Kernel methods: Heuristic View That means, in many data sets it fails to detect nonlinearship among the variables. • The other demerits • - Cann’t handle non-vectorial data

  9. 9 Kernel Methods Outliers detection Test of independence Data depth function Test of equality of distributions More…………………………………….

  10. In Modern Multivariate Analysis we consider linear combinations of feature vector: : In Classical Multivariate Analysis we consider linear combinations of input vector: 10 Kernel methods: Heuristic View We make use concepts of length and dot product available in Euclidean space. We make use concepts of length and dot product/inner product available in Euclidean/non-Euclidean space.

  11. Some Review of College Geometry y+x-1>0 (1,1) y+x-1=0 ky+kx-k=0 90 y+x-1<0 Different effect of k on two signed regions

  12. Some Review of College GeometryIn General Form wx+b>0 w wx+b=0 kwx+kb=0 90 wx+b<0 Different effect of k on two signed regions

  13. Some Review of College GeometryIn General Form Effect of change in b w wx+b=0 90 Effect of change in w

  14. Linear Kernel = , , Let Its RKHS, . It can be shown,

  15. Linearly Seperable Classes

  16. a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 w x+ b>0 w x+ b=0 How would you classify this data? w x+ b<0

  17. a Linear Classifiers x f yest f(x,w,b) = sign(w x +b) denotes +1 denotes -1 How would you classify this data?

  18. a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 How would you classify this data?

  19. a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 Any of these would be fine.. ..but which is the best?

  20. a Linear Classifiers x f yest f(x,w,b) = sign(w x +b) denotes +1 denotes -1 How would you classify this data? Misclassified to +1 class

  21. a a Classifier Margin x x f f yest yest f(x,w,b) = sign(w x +b) f(x,w,b) = sign(w x +b) denotes +1 denotes -1 denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

  22. Geometric margin versus functional margin

  23. a Maximum Margin x f yest • Maximizing the margin is good according to intuition and PAC theory • Implies that only support vectors are important; other training examples are ignorable. • Empirically it works very very well. f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

  24. Linear SVM Mathematically Our Goal • 1) Correctly classify all training data if yi = +1 if yi = -1 for all i 2) Maximize the Margin same as minimize

  25. Linear SVM Mathematically We can formulate a Quadratic Optimization Problem and solve for w and b • Minimize subject to Linear inequality constraints Strictly convex quadratic function

  26. e11 e2 wx+b=1 e7 wx+b=0 wx+b=-1 Soft Margin Classification Slack variablesξi can be added to allow misclassification of difficult or noisy examples. What should our quadratic optimization criterion be? Minimize

  27. Hard Margin v.s. Soft Margin • The old formulation: • The new formulation incorporating slack variables: • Parameter C can be viewed as a way to control overfitting. Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi,yi)} yi (wTxi+ b) ≥ 1 Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi,yi)} yi(wTxi+ b) ≥ 1- ξi and ξi≥ 0 for all i

  28. ● Expenditures ● ● ● ● ● Age Linear Support Vector Regression • Marketing Problem Given variables: • person’s age • income group • season • holiday duration • location • number of children • etc. (12 variables) Predict: • the level of holiday Expenditures

  29. ● ● ● ● ● ● ● Expenditures Expenditures ● ● ● ● ● ● Age Age Linear Support Vector Regression “Lazy case” (underfitting) ● ● ● ● Expenditures ● ● ● Age “Compromise case”, SVR (good generalizability) “Suspiciously smart case” (overfitting)

  30. Linear Support Vector Regression • The epsilon-insensitive loss function penalty 4 ● ● 3 ● ● ● ● 2 error, penalty ● 1 45° ● 0

  31. middle-sized area ● ● ● ● ● ● ● ● Expenditures Expenditures ● ● ● ● ● ● Age Age Linear Support Vector Regression “Lazy case” (underfitting) small area biggest area ● ● ● ● Expenditures ● ● ● “Support vectors” Age “Suspiciously smart case” (overfitting) “Compromise case”, SVR(good generalizability) • The thinner the “tube”, the more complex the model

  32. ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Map the data into a higher-dimensional space:

  33. ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Map the data into a higher-dimensional space:

  34. ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Finding the value of a new point:

  35. Linear SVR: Derivation ● ● ● ● Expenditures ● ● ● Age • Given training data • Find: , such that optimally describes the data: (1)

More Related