1 / 78

Recall , in linear methods for classification and regression

afrew
Download Presentation

Recall , in linear methods for classification and regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch 6. Kernel Methodsby Aizerman et al. (1964).Re-introduced in the context of large margin classifiers by Boser et al. (1992).Vapnik (1995), Burges (1998), CristianiniandShawe-Taylor (2000), M uller et al. (2001), SchölkopfandSmola(2002),andHerbrich (2002). C. M. Bishop, 2006.

  2. Recall, in linearmethodsforclassificationandregression ClassicalApproaches: Linear, parametricornonparametric. A set of training data is used to obtain a parameter vector . • Step1: Train • Step 2: Recognize KernelMethods: Memory-based • store the entire training set in order to make predictions for future data points (nearest neighbors). • Transform data tohigherdimensionalspaceforlinearseparability

  3. Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space. CCKM'06

  4. Example • Consider the mapping • If we consider a linear equation in this feature space: • We actually have an ellipse – i.e. a non-linear shape in the input space. CCKM'06

  5. Capacity of feature spaces • The capacity is proportional to the dimension • 2-dim:

  6. Form of the functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding CCKM'06

  7. Problems of high dimensions • Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well • Computational costs involved in dealing with large vectors CCKM'06

  8. Recall • Two theoretical approaches converged on similar algorithms: • Bayesian approach led to Bayesian inference using Gaussian Processes • Frequentist Approach: MLE • First we briefly discuss the Bayesian approach before mentioning some of the frequentist results CCKM'06

  9. I. Bayesian approach • The Bayesian approach relies on a probabilistic analysis by positing • a pdf model • a prior distribution over the function class • Inference involves updating the prior distribution with the likelihood of the data • Possible outputs: • MAP function • Bayesian posterior average CCKM'06

  10. Bayesian approach • Avoids overfitting by • Controlling the prior distribution • Averaging over the posterior CCKM'06

  11. Bayesian approach • Subject to assumptions about pdf model and prior distribution: • Can get error on the output • Compute evidence for the model and use for model selection • Approach developed for different pdf models • eg classification • Typically requires approximate inference CCKM'06

  12. 2. Frequentistapproach • Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data • Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived • Main focus is on generalisation error analysis CCKM'06

  13. Generalization • What do we mean by generalisation? CCKM'06

  14. Generalizationof a learner CCKM'06

  15. Example of Generalisation • We consider the Breast Cancer dataset • Use the simple Parzen window classifier: weight vector is where is the average of the positive (negative) training examples. • Threshold is set so hyperplane bisects the line joining these two points. CCKM'06

  16. Example of Generalisation • By repeatedly drawing random training sets S of size m we estimate the distribution of by using the test set error as a proxy for the true generalisation • We plot the histogram and the average of the distribution for various sizes of training set 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7. CCKM'06

  17. Example of Generalisation • Since the expected classifier is in all cases the same we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly. CCKM'06

  18. Error distribution: full dataset CCKM'06

  19. Error distribution: dataset size: 342 CCKM'06

  20. Error distribution: dataset size: 273 CCKM'06

  21. Error distribution: dataset size: 205 CCKM'06

  22. Error distribution: dataset size: 137 CCKM'06

  23. Error distribution: dataset size: 68 CCKM'06

  24. Error distribution: dataset size: 34 CCKM'06

  25. Error distribution: dataset size: 27 CCKM'06

  26. Error distribution: dataset size: 20 CCKM'06

  27. Error distribution: dataset size: 14 CCKM'06

  28. Error distribution: dataset size: 7 CCKM'06

  29. Observations • Things can get bad if number of training examples small compared to dimension • Mean can be bad predictor of true generalisation – • i.e. things can look okay in expectation, but still go badly wrong • Key ingredient of learning – keep flexibility high while still ensuring good generalisation CCKM'06

  30. Controlling generalisation • The critical method of controlling generalisation for classification is to force a large margin on the training data: CCKM'06

  31. Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space.

  32. Study: HilbertSpace • Functionals: A mapfrom vector space to a field • Duality: • Inner product • Norm • Similarity • Distance • Metric (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  33. KernelFunctions • k(x, x )=φ(x)Tφ(x ). • For example • k(x, x )=(xTx’+c)M • What if x and x’ are two images? • The kernel represents a particular weighted sum of all possible products of M pixels in the first image with M pixels in the second image.

  34. KernelFunction: Evaluated at the training data pointsk(x, x’ )=φ(x)Tφ(x’ ). • Linear Kernels: k;(x,x’) = xTx’ • Stationary kernels: Invariant to translation • Homogeneous kernels, i.e., radial basis functions:

  35. KernelTrick • if we have an algorithm in which the input vector x enters only in the form of scalar products, then we can replace that scalar product with some other choice of kernel.

  36. 6.1 Dual Representations • Consider a linear regression model for regularized SSE function • If we set • Where nth row of is • And

  37. 6.1 Dual Representations (2/4) • We can now reformulate the least-squares algorithm in terms of a (dual representation). We substitute into to obtain • Define Gram Matrix with entries

  38. 6.1 Dual Representations (3/4) • The sum-of-squares error function can be written as • Setting the gradient of with respect to a to zero, we obtain optimal a • Recall a was a function of w

  39. 6.1 Dual Representations (4/4) • We obtain the following prediction for a new input x by substituting this as where we define the vector k(x) with elements • Prediction y(x) is computed from thelinear combo of t • Y(x) is expressed entirely in terms of the kernel function k(x,x’). • w is expressed in terms of linear combo of a w =aTф(x)

  40. Recall • Linear regresion solution : w= [ΦT Φ +λI]-1ΦTt • Dual Representation: a= [K+λI]-1t • Note K is NxN • Φ is MxM

  41. 6.2 Constructing Kernels (1/5) • Kernel function is defined as inner product of two functions • Example of kernel

  42. Basis Functions and corresponding Kernels Figure 6.1 Upper Plot: basis functions (polynomials, Gaussians, logistic sigmoid), and lower plots are kernel functions. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  43. Constructing Kernels • A necessary and sufficient condition for a function to be a valid kernel is that the Gram matrix K should be positive semidefinite. • Techniques for constructing new kernels: given k1 (x,x’) and k2 (x,x’), the following new kernels will also be valid. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  44. Gaussian Kernel • Show that: The feature vector that corresponds to the Gaussian kernel has infinity dimensionality. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  45. Construction of Kernels from Generative Models • Given p(x), define a kernel function k((x,x’) = p(x)p(x’) • A kernel function measuring the similarity of two sequences: z is hidden variable • Leads to hidden Markov model if x and x’ are sequences of outcomes

  46. Fisher Kernel • Consider Fisher Score: • Then fisher kernel is defined as Where F is the Fisher information matrix, (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  47. Sigmoid kernel • This sigmoid kernel form gives the support machine a superficial resemblance to neural network model.

  48. How to select the functions? x • Assume fixed nonlinear transformation • Transform inputs using a vector of basis functions • The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  49. Radial Basis Function Networks • Each basis function depends only on the radial distance from a center μj, so that (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

  50. 6.3 Radial Basis Function Networks (2/3) • Let’s consider of the interpolation problem when the input variables are noisy. If the noise on the input vector x is described by a variable ξ having a distribution ν(ξ), the sum-of-squares error function becomes as follows: • Using the calculus of variation, (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

More Related