1 / 40

Chapter 9

... want P1 to be more similar to P2 such persons may be able to drive an automobile! ... the training data for C2 is correctly classified by C1, other half is misclassified by C1 ...

Chapter 9

E N D

Presentation Transcript

Slide 1:Chapter 9 Given a range of classification algorithms, which is the best? Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,…. Principle of Occam’s Razor: given two classifiers that perform equally well on the training set, it is asserted that the simpler classifier may do better on test set This chapter focuses on mathematical foundations that do not depend on a particular classifier or learning algorithm Bias and variance dilemma Ensemble of Classifiers (classifier combination) Cross validation Resampling

Slide 2:No Free Lunch Theorem Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior or inferior overall? No Free Lunch Theorem: Answer to above question: NO If the goal is to obtain good generalization performance, there is no context-independent or usage-independent reasons to favor one algorithm over others If one algorithm seems to outperform another in a particular situation, it is a consequence of its fit to a particular pattern recognition problem For a new classification problem, what matters most: prior information, data distribution, size of training set, cost fn.

Slide 3:No Free Lunch Theorem It is the assumptions about the learning algorithm that are important Even popular algorithms will perform poorly on some problems, where the learning algorithm and data distribution do not match well In practice, experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems

Slide 4:Ugly Duckling Theorem In the absence of assumptions, there is no best feature representation The similarity between patterns is fundamentally based on implicit assumptions about the problem domain Consider the example where features x and y represent blind_in_right_eye and blind_in_left_eye, respectively. If we base similarity on shared features, person P1 = {1, 0} (blind only in the right eye} is maximally different from person P2 = {0, 1} (blind only in the left eye). In this scheme P1 is more similar to a totally blind person and to a normally sighted person than he is to P2! There may be situations where we want P1 to be more similar to P2—such persons may be able to drive an automobile!

Slide 5:Bias and Variance No “best classifier” in general Necessity for exploring a variety of methods How to evaluate if the learning algorithm “matches” the classification problem Bias: measures the quality of the match High-bias implies poor match Variance: measures the specificity of the match High-variance implies a weak match Bias and variance are not independent of each other

Slide 6:Bias and Variance Given true function F(x) Estimated function g(x; D) from a training set D Training set D is a random variable. Each training set gives an estimate of error in the fit Taking average over all training sets of size n, MSE is

Slide 7:Bias-Variance Dilemma in Regression

Slide 8:Bias Variance Dilemma Procedures with increased flexibility to adapt to training data have lower bias, but higher variance Large number of parameters Fits well and have low bias, but high variance Inflexible procedures have higher bias, but lower variance Fewer number of parameters May not fit well to data: have high bias, but low variance A large amount of training data generally helps improve performance of estimation if the model is sufficiently general to represent the target function Bias/Variance considerations recommend that we gather as much prior information about the problem as possible to find a best match for the classifier, and as large a dataset as possible to reduce the variance

Slide 9:Bias and Variance for Classification Low variance is more important for accurate classification than low boundary bias Classifiers with large flexibility to adapt to training data (more free parameters) tend to have low bias but high variance 2-class problem; 2D Gaussian distribution with diagonal covariances. Small no. of training data to estimate parameters of 3 different models For best classification given small training data, need to match model to the true distributions

Slide 10:Ensemble-based Systems in Decision Making Polikar R., “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol.6, no. 3, pp. 21-45, 2006 Polikar R., “Bootstrap Inspired Techniques in Computational Intelligence,” IEEE Signal Processing Magazine, vol.24, no. 4, pp. 56-72, 2007 Polikar R., “Ensemble Learning,” Scholarpedia, 2008.

Slide 11:Ensemble-based Classifiers Ensemble based systems provide favorable results compared to single-expert systems for a broad range of applications & under a variety of scenarios How to (i) generate individual components of the ensemble systems (base classifiers), and (ii) how to combine the outputs of individual classifiers? Popular ensemble based algorithms Bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts Commonly used combination rules Algebraic combination of outputs, voting methods, behavior knowledge space & decision templates

Slide 12:Why Ensemble Based Systems? Statistical reasons A set of classifiers with similar training performances may have different generalization performances Combining outputs of several classifiers reduces the risk of selecting a poorly performing classifier Large volumes of data If the amount of data to be analyzed is too large, a single classifier may not be able to handle it; train different classifiers on different partitions of data Too little data Ensemble systems can also be used when there is too little data; resampling techniques

Slide 13:Why Ensemble Based Systems? Divide and Conquer Divide data space into smaller & easier-to-learn partitions; each classifier learns only one of the simpler partitions

Slide 14:Why Ensemble Based Systems? Data Fusion Given several sets of data from various sources, where the nature of features is different (heterogeneous features), training a single classifier may not be appropriate (e.g., MRI data, EEG recording, blood test,..) Applications in which data from different sources are combined are called data fusion applications Ensembles have successfully been used for fusion All ensemble systems must have two key components: Generate component classifiers of the ensemble Method for combining the classifier outputs

Slide 15:Brief History of Ensemble Systems Dasarathy and Sheela (1979) partitioned the feature space using two or more classifiers Schapire (1990) proved that a strong classifier can be generated by combining weak classifiers through boosting; predecessor of AdaBoost algorithm Two types of combination: classifier selection Each classifier is trained to become an expert in some local area of the feature space; one or more local experts can be nominated to make the decision classifier fusion All classifiers are trained over the entire feature space; fusion involves merging the individual (weaker) classifiers to obtain a single (stronger) expert of superior performance

Slide 16:Diversity of Ensemble Objective: create many classifiers, and combine their outputs to improve the performance of a single classifier Intuition: if each classifier makes different errors, then their strategic combination can reduce the total error! Need base classifiers whose decision boundaries are adequately different from those of others Such a set of classifiers is said to be diverse How to achieve classifier diversity? Use different training sets to train individual classifiers How to obtain different training sets? Resampling techniques: bootstrapping or bagging, training subsets are drawn randomly, usually with replacement, from the entire training set

Slide 17:Sampling with Replacement Random & overlapping training sets to train three classifiers; they are combined to obtain a more accurate classification

Slide 18:Sampling without Replacement Jackknife or k-fold data split: Entire data is split into k blocks; each classifier is trained only on different subset of (k-1) blocks

Slide 19:Other Approaches to Achieve Diversity Use different training parameters for a classifier A series of MLP can be trained using different weight initializations, number of layers/nodes, etc. Adjusting these parameters controls the instability of such classifiers (local minima) Similar strategy can be used to generate different decision trees for the same problem Different types of classifiers (MLPs, decision trees, NN classifiers, SVM) can be combined for added diversity Diversity can also be achieved by using random feature subsets, called random subspace method

Slide 20:Creating An Ensemble Two questions: How will the individual classifiers be generated? How will they differ from each other? Answer determines the diversity of classifiers & fusion performance Seek to improve ensemble diversity by some heuristic methods

Slide 21:Bagging Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

Slide 22:Bagging

Slide 23:Variations of Bagging Random Forests so-called because it is constructed from decision trees A random forest is created from individual decision trees, whose training parameters vary randomly Such parameters can be bootstrapped replicas of the training data, as in bagging But they can also be different feature subsets as in random subspace methods

Slide 24:Boosting Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: First classifier C1 is trained with a random subset of the available training data Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 Third classifier C3 is trained on instances on which both C1 & C2 disagree

Slide 25:Boosting

Slide 26:AdaBoost AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-to-classify samples

Slide 27:AdaBoost A weight distribution Dt(i) on training instances xi , i=1,…,N from which training data subsets St are chosen for each consecutive classifier (hypothesis) ht A normalized error is then obtained as ?t , such that for 0<?t <1/2, they have 0< ?t <1 Distribution update rule: The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of ?t , whereas the weights of the misclassified instances are unchanged. AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by ht , and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/?t is therefore a measure of performance, of the tth hypothesis and can be used to weight the classifiers

Slide 29:AdaBoost.M1 AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

Slide 30:Boosting

Slide 32:Performance of AdaBoost In most practical cases, the ensemble error decreases very rapidly in the first few iterations, and approaches zero or stabilizes as new classifiers are added AdaBoost does not seem to be affected by overfitting; explained by margin theory

Slide 33:Stacked Generalization An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C1, …,CT are trained using training parameters ?1 through ?T to output hypotheses h1 through hT The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, CT+1

Slide 34:Mixture-of-Experts A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C1, …,CT constitute the ensemble, followed by a second-level classifier CT+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

Slide 35:Mixture of Experts The pooling system may use the weights in several different ways. it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

Slide 36:Combining Classifiers How to combine classifiers? Combination rules grouped as (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of ?j , j=1,…,C) Other rules need continuous-valued outputs of individual classifiers

Slide 37:Combining Class Labels Assume that only class labels are available from the classifier outputs Define the decision of the tth classifier as dt,j?{0,1} , t=1,…,T and j=1,…,C , where T is the number of classifiers and C is the number of classes If tth classifier chooses class ?j , then dt,j=1, 0 otherwise Majority Voting : Weighted Majority Voting – Behavior Knowledge Space (BKS) – look up Table Borda Count – each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N - 1 votes, the second-place candidate receives N - 2, with the candidate in ith place receiving N - i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision

Slide 38:Combining Continuous Outputs Algebraic combiners Mean Rule: Weighted Average: Minimum/Maximum/Median Rule: Product Rule: Generalized Mean: Many of the above rules are in fact special cases of the generalized mean : minimum rule; :maximum rule; : : mean rule

Slide 39:Combining Continuous Outputs

Slide 40:Conclusions Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data

More Related