1 / 31

Machine learning continued

Machine learning continued. Image source: https://www.coursera.org/course/ml. More about linear classifiers. When the data is linearly separable, there may be more than one separator ( hyperplane ). Which separator is best?. Support vector machines.

elisa
Download Presentation

Machine learning continued

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine learning continued Image source: https://www.coursera.org/course/ml

  2. More about linear classifiers • When the data is linearly separable, there may be more than one separator (hyperplane) Which separatoris best?

  3. Support vector machines • Find hyperplane that maximizes the margin between the positive and negative examples C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  4. Support vector machines • Find hyperplane that maximizes the margin between the positive and negative examples For support vectors, Distance between point and hyperplane: Therefore, the margin is 2 / ||w|| Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  5. Finding the maximum margin hyperplane • Maximize margin 2 / ||w|| • Correctly classify all training data: • Quadratic optimization problem: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  6. Finding the maximum margin hyperplane • Solution: learnedweight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  7. Finding the maximum margin hyperplane • Solution:b = yi – w·xi for any support vector • Classification function (decision boundary): • Notice that it relies on an inner product between the testpoint x and the support vectors xi • Solving the optimization problem also involvescomputing the inner products xi· xjbetween all pairs oftraining points C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  8. x 0 x 0 x2 Nonlinear SVMs • Datasets that are linearly separable work out great: • But what if the dataset is just too hard? • We can map it to a higher-dimensional space: 0 x Slide credit: Andrew Moore

  9. Nonlinear SVMs • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable Φ: x→φ(x) Slide credit: Andrew Moore

  10. Nonlinear SVMs • The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such thatK(x,y) = φ(x)· φ(y) • (to be valid, the kernel function must satisfy Mercer’s condition) • This gives a nonlinear decision boundary in the original feature space: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

  11. x2 Nonlinear kernel: Example • Consider the mapping

  12. Polynomial kernel:

  13. Gaussian kernel • Also known as the radial basis function (RBF) kernel: • The corresponding mapping φ(x)is infinite-dimensional! • What is the role of parameter σ? • What if σ is close to zero? • What if σ is very large?

  14. Gaussian kernel SV’s

  15. What about multi-class SVMs? • Unfortunately, there is no “definitive” multi-class SVM formulation • In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs • One vs. others • Traning: learn an SVM for each class vs. the others • Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value • One vs. one • Training: learn an SVM for each pair of classes • Testing: each learned SVM “votes” for a class to assign to the test example

  16. SVMs: Pros and cons • Pros • Many publicly available SVM packages:http://www.kernel-machines.org/software • Kernel-based framework is very powerful, flexible • SVMs work very well in practice, even with very small training sample sizes • Cons • No “direct” multi-class SVM, must combine two-class SVMs • Computation, memory (esp. for nonlinear SVMs) • During training time, must compute matrix of kernel values for every pair of examples • Learning can take a very long time for large-scale problems

  17. Beyond simple classification: Structured prediction Word Image Source: B. Taskar

  18. Structured Prediction Parse tree Sentence Source: B. Taskar

  19. Structured Prediction Word alignment Sentence in two languages Source: B. Taskar

  20. Structured Prediction Bond structure Amino-acid sequence Source: B. Taskar

  21. Structured Prediction • Many image-based inference tasks can loosely be thought of as “structured prediction” model Source: D. Ramanan

  22. Unsupervised Learning • Idea: Given only unlabeled data as input, learn some sort of structure • The objective is often more vague or subjective than in supervised learning • This is more of an exploratory/descriptive data analysis

  23. Unsupervised Learning • Clustering • Discover groups of “similar” data points

  24. Unsupervised Learning • Quantization • Map a continuous input to a discrete (more compact) output 2 1 3

  25. Unsupervised Learning • Dimensionality reduction, manifold learning • Discover a lower-dimensional surface on which the data lives

  26. Unsupervised Learning • Density estimation • Find a function that approximates the probability density of the data (i.e., value of the function is high for “typical” points and low for “atypical” points) • Can be used for anomaly detection

  27. Semi-supervised learning • Lots of data is available, but only small portion is labeled (e.g. since labeling is expensive) • Why is learning from labeled and unlabeled data better than learning from labeled data alone? ?

  28. Active learning • The learning algorithm can choose its own training examples, or ask a “teacher” for an answer on selected inputs S. Vijayanarasimhan and K. Grauman, “Cost-Sensitive Active Visual Category Learning,” 2009

  29. Lifelong learning http://rtw.ml.cmu.edu/rtw/

  30. Lifelong learning http://rtw.ml.cmu.edu/rtw/

  31. Xinlei Chen, AbhinavShrivastava and Abhinav Gupta. NEIL: Extracting Visual Knowledge from Web Data. In ICCV 2013

More Related