SVM + local PCA-based SV “blurring “:

SVM + local PCA-based SV “blurring “: Exploiting invariance patterns for knowledge discovery using structured risk minimization Michel Bera, KXEN Inc., Emmanuel Viennet, University of Paris 13, Damien Weldon, Norkom Technologies Introduction The purpose of this poster is to show an example of a modelling process based on a sequence of multiple SRM-tuned treatments of a set of data, demonstrating how a regularization scheme, based on the local geometry of the data density, improves a linear SVM for classification. Normally, if data is processed, it is dangerous to go through multiple SRM-based treatments to this data, trading fit-robustness at each step. Indeed, the first treatments very often regularize the data too much, leaving no more information to be appropriately learned in further treatments, to build the final model. This an open matter for discussion. The data we use here is a random subset of 10,000 elements of the Adult Census set. The challenge is to build a classifier to determine who in a population makes more than $50,000 of annual income, given 14 attributes describing the profile (eg. Education, portfolio, race, country of origin, etc.).

Model build-up process: 4 steps Note: For ease of interpretation, this is a 2D sketch with 8 virtual points, instead of 26 in 3D STEP 1 : encode in real values (first SRM) Encode the data from strings and nominal to real values, using K2C, data preparation tool. STEP2 : first linear SVM Run a first linear SVM to determine an initial set of support vectors for the problem. This is based on a Bernard Schoelkopf idea [1]

STEP3 : PCA-based grid buildup, “blurring” each SV 1) Estimate the local geometry of the data around each support vector, based on nearest neighbours (we take the 50 closest SVs) : this is very close to a Hastie – Tibshirani idea for DANN [2] and we compute a local PCA, generating the three first factors and eigenvalues 2) “Blurr” each SV location, generating artificial “shaken” new points around each SV (see Fig. 1), in a small parallelipedic volume grid. We use the first three factors/eigenvalues of each PCA to define the orientation and stretch of each grid. , so each SV gives birth to 26 new points. STEP4 : second linear SVM Compute a second linear SVM with the new points, that gives the final classifier

Results “Adult Census” file set : 10,000 first lines (out of 50,000) Cut : estimation set : 5941, validation set : 1985, test set : 2074 KSVM (OrdLinear) standard : Number of SVs : 2715 TestErrRate : 17.6% TestKi : 0.758 KSVM, learn on estim + “blurred” SV from local PCA (p=3) (estim size = 26*2715 = 70590 points) (we drop the original SV, because it is surrounded” by the “blurred” ones, so it will never become a SV in the second SVM) Number of SVs : 12876 TestErrRate : 16.56% (Gain greater than 1%) TestKi : 0.780 Note: KSVM and K2C are two standard software components of the KXEN Analytic FrameworkTM

Regularization and Lie Group interpretation If we consider underlying data probability density of the sample, with n attributes, it can be considered as a Rn manifold. If we take a point at the surface, say one of the SVs from the first SVM, and we want to create “blurr” by creating virtual new SVs around it, we will consider placing them in the iso-density (orthogonal to the local density gradient) tangent hyperplane , which is another Rn manifold. By computing a local PCA with the nearest-neighbours, we project  into R3, and we use for this projection the first three PCA factors with highest variance. As J. Friedman, T. Hastie and R. Tibshirani suggest in their book “The Elements of Statistical Learning” (2001), p145, a PCA is very similar to a generalized ridge regression penalty approach (in a ridge regression you penalize small eigenvalues, in a PCA you bluntly kill them). Vapnik (eg. 1995, “The Nature of Statistical Learning Theory”, Springer) first demonstrates the relationship between ridge regression penalty and resulting VC dimension control: the loop is closed. Now the 3-D grid parallelipiped is a very trivial set of orbits from a subset of the group of transformations that leave this projected  into R3 invariant : so we have here all the usual ingredients from Lie Group and Rieman manifolds theory, (see for example Chevalley, C. (1946), “Theory of Lie groups”, Princeton University Press), as was quoted by Ch. Burges and B. Schoelkopf in various papers, to describe those concepts of invariant distances, etc. This will be detailed in a coming paper.

References • [1] The "virtual SV" methodology is an idea from Bernhard Schoelkopf: http://www.kyb.tuebingen.mpg.de/~bs/ • Chapelle, O. and Schoelkopf, B. (2001) Incorporating Invariances in Nonlinear Support Vector Machines . Technical Report • Fernandez, R. and Viennet, E. (1999) Face Identification using Support Vector Machines, ESANN'99, European Symposium on Artificial Neural Networks • [2] Hastie, T. and Tibshirani, R. (1996a) Discriminant adaptive nearest neighbor classification, IEEE Pattern Recognition and Machine Intelligence 18: 607-616 • Schoelkopf, B. (1997). Support Vector Learning. München: R. Oldenbourg Verlag. Doktorarbeit, TU Berlin. • Generating new samples from an image, or using the invariant transformation group of a picture (stretching, rotating, translating, etc.) to improve learning efficiency is an “old” 1993 idea (see Burges, C.) but also see : • Simard, P., LeCun, Y. and Denker, J. (1993), Efficient pattern recognition using a new transformation distance, in S. J. Hanson, J. Cowan and L. Giles,eds, `Advances in Neural Information Processing 5', Morgan Kaufmann Publishers, San Mateo, CA. • Acknowledgements : This work profited highly from discussions with Léon Bottou, Françoise Fogelman, Yann LeCun, Erik Marcade, Gilbert Saporta, Bernhard Schoelkopf, Vladimir Vapnik

SVM + local PCA-based SV “blurring “: