Giorgio Valentini

Random aggregated and bagged ensembles of SVMs: an empirical bias-variance analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it DSI – Dipartimento di Scienze dell’ Informazione Università degli Studi di Milano

Goals • Developing methods and procedures to estimate the bias-variance decomposition of the error in ensembles of learning machines • A quantitative evaluation of the variance reduction property in random aggregated and bagged ensembles (Breiman,1996). • A characterization of bias-variance (BV) decomposition of the error in bagged and random aggregated ensembles of SVMs, comparing the results with BV decomposition in single SVMs (Valentini and Dietterich, 2004) • Getting insights into the reasons why the ensemble method Lobag (Valentini and Dietterich, 2003) works. • Getting insights into the reasons why random subsampling techniques works with large data mining problems (Breiman, 1999; Chawla et al. 2002).

Random aggregated ensembles Let D = {(xj, tj)}, 1£ j £ m, be a set of msamples drawn identically and independently from a population Uaccordingto P, where P(x, t)is the joint distribution of the data points in U. Let Lbe a learning algorithm, and define fD= L(D) as the predictorproduced by Lapplied to a training set D. The model produces a predictionfD(x) = y. Suppose that a sequence of learning sets { Dk} is given, eachi.i.d. from the same underlying distribution P. Breiman proposed to aggregatethe fDtrained with different samples drawn from Uto get a better predictorfA(x, P). For classification problems tjÎSÌN, and fA(x, P) = arg maxj|{k | fDk(x) = j }| . As the training sets Dare randomly drawn from U, we name the procedureto build fArandom aggregating.

Random aggregation reduces variance Considering regression problems, if Tand Xare randomvariables having joint distribution P, the expected squared loss EL for the singlepredictor fD(X) is: EL = ED[ET,X[(T- fD(X))2]] while the expected squared loss ELAfor the aggregated predictor is: ELA= ET,X[(T- fA(X))2] Breiman showed that EL ³ELA. This disequality depends on the instability of the predictions, that is on how unequal the two sides of the following eq. are: ED[fD(X)]2£ED[f D2(X)] There is a strict relationship between the instability and the variance of the base predictor. Indeed the varianceV (X) of the base predictor is: V (X) = ED[(fD(X) - ED[fD(X)])2]= ED[f D2(X)] - ED[fD(X)]2 Breiman showed also that in classification problems, as in regression, aggregating “good” predictors can lead to better performances, as longas the base predictor is unstable, whereas, unlike regression, aggregating poorpredictors can lower performances.

How much does the variance reduction property hold for bagging too? Breiman theoretically showed the random aggregating reduces variance Baggingis an approximation of random aggregating, for at least two reasons: • Bootstrap samples are not “real” data samples: they are drawn from a dataset D, that is in turn a sample from the population U. On the contrary fAusessamples drawn directly from U. • Bootstrap samples are drawn from Daccording to an uniform probability distribution, which is only an approximation ofthe unknown true distribution P. • Does the variance reduction property hold for bagging too ? • Can we provide a quantitative estimate of variance reduction both in random aggregating and bagging?

A quantitative estimate of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of learning machines • We developed procedures to quantitatively evaluate bias-variance decompostion of the error according to Domingos unified bias-variance theory (Domingos, 2000). • We proposed three basic techniques (Valentini, 2003): • Out-of-bag, or cross-validation estimate (when only small samples are available) • Hold-out techniques (when relatively large data sets are available) • In order to get a reliable estimate of the error we applied the second technique evaluating the bias-variance decomposition using quite large test sets. • We summarize here the two main experimental steps to perform bias variance analysis with resampling-based ensembles: • Procedures to generate data for ensemble training • Bias-variance decomposition of the error on a separated test set

Procedure to generate training samples for bagged ensembles Procedure to generate training samples for random aggregates ensembles

Procedure to estimate the bias-variance decomposition of the error in ensembles of learning machines

Comparison of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of SVMs on 7 two-class classification problems Gaussian kernels Linear kernels • Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled lines refer to random aggregated ensembles, triangle to bagged ensembles. • In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged ensemble the errror decreases from 0 to 15% depending on the data set. • Variance is significantly reduced in RA ens. (about 90%), while in bagging the variance reduction is quite limited, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is registered.

Characterization of bias-variance decompostion of the error in random aggregated ensembles of SVMs (gaussian kernel) • Lines labeled with crosses: single SVMs • Lines labeled with triangles: RA SVM ensembles

Lobag works when unbiased variance is relatively high • Lobag (Low bias bagging) is a variant of bagging that uses low biased base learners selected through bias-variance analysis procedures (Valentini and Dietterich, 2003). • Our experiments with bagging show the reasons why Lobag works: bagging lowers variance, but the bias remains substantially unchanged. Hence selecting low bias base learners Lobag reduces both bias (through bias-variance analysis) and variance (through classical aggregation techniques) • Valentini and Dietterich experimentally showed that Lobag is effective, with SVMs as base learners, when small sized samples are used, that is when the variance due to reduced cardinality of the available data is relatively high. But when we have relatively large data sets, we may expect that lobag does not outperform bagging (because in this case, on the average, the unbiased variance will be relatively low).

Why random subsampling techniques work with large databases ? • Breiman proposed random subsampling techniques for classification in large databases, using decision trees as base learners (Breiman, 1999), and these techniques have been also successfully applied in distributedenvironments (Chawla et al., 2002). • Random aggregating can also be interpreted as a technique to draw from a large population small subsamples to train the base learners and then aggregating them e.g. by majority voting. • Our experiments on random aggregated ensembles show that the variance component of the error is strongly reduced, while the bias remains unchanged or it is lowered , getting insights into the reasons why random subsampling techniques works with large data mining problems. In particular our experimental analysis suggests to apply SVMs trianed on small subsamples when large database are available or when they are fragmented in distributed systems.

Conclusions • We showed how to apply bias-variance decomposition techniques to the analysis of bagged and random aggregated ensembles of learning machines. • These techniques have been applied to the analysis of bagged and random aggregated ensembles of SVMs, but can be directly applied to a large set of ensemble methods* • The experimental analysis show that random aggregated ensembles significantly reduce the variance component of the error w.r.t. single SVMs, but this property only partially holds for bagged ensembles. • The empirical bias variance analysis gets also insights into the reasons why Lobag works, highliting on the other hand some limitations of the Lobag approach. • The bias-variance analysis of random aggregated ensembles highlights also the reasons of their successfull application to large scale data mining problems. *the C++ classes and applications to perform BV analysis are freely available at: http://homes.dsi.unimi.it/~valenti/sw/NEURObjects

References • Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123-140 • Breiman, L.: Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning 36 (1999) 85-103 • Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.: Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of Lecture Notes in Computer Science., Springer-Verlag (2002) 52-61 • Domingos, P.A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In: Proc. 17th National Conference on Artificial Intelligence, Austin, TX, AAAI Press (2000) 564-569 • G. Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines. ICML 2003, pages 752-759, Washington D.C., USA (2003). AAAI Press. • Valentini, G. Ensemble methods based on bias-variance analysis. PhD thesis, DISI, Università di Genova, Italy (2003), ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale-th-2003-04.pdf. • Valentini, G., Dietterich, T.G.: Bias-variance analysis of Support Vector Machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research (2004) (accepted for publication)

Giorgio Valentini

Giorgio Valentini

Presentation Transcript

Giorgio Armani clothing

Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

GIORGIO ARMANI

Giorgio de Chirico

Giorgio Vasari

Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Alberto Bertoni, Giorgio Valentini

Giorgio Armani

Giorgio Armani

Giorgio Arcangeli

Giorgio Arnaldi

Giorgio Valentini

Giorgio Milano

Giorgio Milano