1 / 29

Virtual Vector Machine for Bayesian Online Classification

Virtual Vector Machine for Bayesian Online Classification. Yuan (Alan) Qi CS & Statistics Purdue. Joint work with T.P. Minka and R. Xiang. June, 2009. Motivation. Ubiquitous data stream: emails, stock prices, images from satellites, video surveillance

alvaro
Download Presentation

Virtual Vector Machine for Bayesian Online Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue Joint work with T.P. Minka and R. Xiang June, 2009

  2. Motivation Ubiquitous data stream: emails, stock prices, images from satellites, video surveillance How to process a data stream using a small memory buffer and make accurate predictions?

  3. Outline • Introduction • Virtual Vector Machine • Experimental Results • Summary

  4. Introduction • Online learning: • Update model and make predictions based on data points received sequentially • Use a fixed-size memory buffer

  5. Classical online learning • Classification: • Perceptron • Linear regression: • Kalman filtering

  6. Bayesian treatment • Monte Carlo methods (e.g., particle filters) • Difficult for classification model due to high dimensionality • Deterministic methods: • Assumed density filtering: Gaussian process classification models (Csato 2002).

  7. Virtual Vector Machine preview • Two parts: • Gaussian approximation factors • Virtual points for nonGaussian factors • Summarize multiple real data points • Flexible functional forms • Stored in data cache with a user-defined size.

  8. Outline • Introduction • Virtual Vector Machine • Experimental Results • Summary

  9. Online Bayesian classification Model parameters: Data from time 1 to T: Likelihood function at time t: Prior distribution: Posterior at time T:

  10. Flipping noise model • Labeling error rate • : Feature vector scaled by 1 or -1 depending on the label. • Posterior distribution: planes cutting a sphere for 3-D case.

  11. Gaussian approximation by EP • approximates the likelihood • Both and have the form of Gaussian. Therefore, is a Gaussian.

  12. VVM enlarges approximation family : virtual point : exact form of the original likelihood function. (Could be more flexible.) : residue

  13. Reduction to Gaussian • From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior : • is Gaussian too.

  14. Cost function for finding virtual points • Minimizing cost function with ADF spirit: contains one more nonlinear factor than . • Maximizing surrogate function: • Keep informative (non-Gaussian) information in virtual points. Computationally intractable…

  15. Cost function for finding virtual points • Minimizing cost function with ADF spirit: contains one more nonlinear factor than . • Maximizing surrogate function: • Keep informative (non-Gaussian) information in virtual points.

  16. Two basic operations • Searching over all possible locations for : computationally expensive! • For efficiency, consider only two operations to generate virtual points: • Eviction: delete the least informative point • Merging: merge two similar points to one

  17. Eviction • After adding the new point into the virtual point set, • Select by maximizing • Remove from the cache • Update the residual via

  18. Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after deleting one point (with the largest margin)

  19. Merging • Remove from the cache, • Insert the merged point into the cache • Update the residual via where Gaussian residual term captures the lost information from the original two factors. Equivalent to replace by

  20. Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after merging two similar points

  21. Compute residue term • Inverse ADF: match the moments of the left and right distributions: Efficiently solved by Gauss-Newton method as an one-dimensional problem

  22. Algorithm Summary

  23. Classification with random features • Random feature expansion (Rahimi & Recht, 2007): • For RBF kernels, we use random Fourier features: • Where are sampled from a special .

  24. Outline • Introduction • Virtual Vector Machine • Experimental Results • Summary

  25. Estimation accuracy of posterior mean Mean square error of estimated posterior mean obtained by EP, virtual vector machine , ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.

  26. Online classification (1) Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

  27. Online nonlinear classification (2) Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.

  28. Online nonlinear classification (3) Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.

  29. Summary • Efficient Bayesian online classification • A small constant space cost • A smooth trade-off between prediction accuracy and computational cost • Improved prediction accuracy over alternative methods • More flexible functional form for virtual points, and other applications

More Related