Virtual Vector Machine for Bayesian Online Classification

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue Joint work with T.P. Minka and R. Xiang June, 2009

Motivation Ubiquitous data stream: emails, stock prices, images from satellites, video surveillance How to process a data stream using a small memory buffer and make accurate predictions?

Outline • Introduction • Virtual Vector Machine • Experimental Results • Summary

Introduction • Online learning: • Update model and make predictions based on data points received sequentially • Use a fixed-size memory buffer

Classical online learning • Classification: • Perceptron • Linear regression: • Kalman filtering

Bayesian treatment • Monte Carlo methods (e.g., particle filters) • Difficult for classification model due to high dimensionality • Deterministic methods: • Assumed density filtering: Gaussian process classification models (Csato 2002).

Virtual Vector Machine preview • Two parts: • Gaussian approximation factors • Virtual points for nonGaussian factors • Summarize multiple real data points • Flexible functional forms • Stored in data cache with a user-defined size.

Online Bayesian classification Model parameters: Data from time 1 to T: Likelihood function at time t: Prior distribution: Posterior at time T:

Flipping noise model • Labeling error rate • : Feature vector scaled by 1 or -1 depending on the label. • Posterior distribution: planes cutting a sphere for 3-D case.

Gaussian approximation by EP • approximates the likelihood • Both and have the form of Gaussian. Therefore, is a Gaussian.

VVM enlarges approximation family : virtual point : exact form of the original likelihood function. (Could be more flexible.) : residue

Reduction to Gaussian • From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior : • is Gaussian too.

Cost function for finding virtual points • Minimizing cost function with ADF spirit: contains one more nonlinear factor than . • Maximizing surrogate function: • Keep informative (non-Gaussian) information in virtual points. Computationally intractable…

Cost function for finding virtual points • Minimizing cost function with ADF spirit: contains one more nonlinear factor than . • Maximizing surrogate function: • Keep informative (non-Gaussian) information in virtual points.

Two basic operations • Searching over all possible locations for : computationally expensive! • For efficiency, consider only two operations to generate virtual points: • Eviction: delete the least informative point • Merging: merge two similar points to one

Eviction • After adding the new point into the virtual point set, • Select by maximizing • Remove from the cache • Update the residual via

Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after deleting one point (with the largest margin)

Merging • Remove from the cache, • Insert the merged point into the cache • Update the residual via where Gaussian residual term captures the lost information from the original two factors. Equivalent to replace by

Version space for 3-D case Version space: brown area EP approximation: red ellipse Four data points: hyperplanes Version space with three points after merging two similar points

Compute residue term • Inverse ADF: match the moments of the left and right distributions: Efficiently solved by Gauss-Newton method as an one-dimensional problem

Algorithm Summary

Classification with random features • Random feature expansion (Rahimi & Recht, 2007): • For RBF kernels, we use random Fourier features: • Where are sampled from a special .

Estimation accuracy of posterior mean Mean square error of estimated posterior mean obtained by EP, virtual vector machine , ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.

Online classification (1) Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

Online nonlinear classification (2) Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.

Online nonlinear classification (3) Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.

Summary • Efficient Bayesian online classification • A small constant space cost • A smooth trade-off between prediction accuracy and computational cost • Improved prediction accuracy over alternative methods • More flexible functional form for virtual points, and other applications

Virtual Vector Machine for Bayesian Online Classification