Download Presentation
## Chapter 9

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Slide 1:**Chapter 9 Given a range of classification algorithms, which is the best?
Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,….
Principle of Occam’s Razor: given two classifiers that perform equally well on the training set, it is asserted that the simpler classifier may do better on test set
This chapter focuses on mathematical foundations that do not depend on a particular classifier or learning algorithm
Bias and variance dilemma
Ensemble of Classifiers (classifier combination)
Cross validation
Resampling

**Slide 2:**No Free Lunch Theorem Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior or inferior overall?
No Free Lunch Theorem: Answer to above question: NO
If the goal is to obtain good generalization performance, there is no context-independent or usage-independent reasons to favor one algorithm over others
If one algorithm seems to outperform another in a particular situation, it is a consequence of its fit to a particular pattern recognition problem
For a new classification problem, what matters most: prior information, data distribution, size of training set, cost fn.

**Slide 3:**No Free Lunch Theorem It is the assumptions about the learning algorithm that are important
Even popular algorithms will perform poorly on some problems, where the learning algorithm and data distribution do not match well
In practice, experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems

**Slide 4:**Ugly Duckling Theorem In the absence of assumptions, there is no best feature representation
The similarity between patterns is fundamentally based on implicit assumptions about the problem domain
Consider the example where features x and y represent blind_in_right_eye and blind_in_left_eye, respectively. If we base similarity on shared features, person P1 = {1, 0} (blind only in the right eye} is maximally different from person P2 = {0, 1} (blind only in the left eye). In this scheme P1 is more similar to a totally blind person and to a normally sighted person than he is to P2!
There may be situations where we want P1 to be more similar to P2—such persons may be able to drive an automobile!

**Slide 5:**Bias and Variance No “best classifier” in general
Necessity for exploring a variety of methods
How to evaluate if the learning algorithm “matches” the classification problem
Bias: measures the quality of the match
High-bias implies poor match
Variance: measures the specificity of the match
High-variance implies a weak match
Bias and variance are not independent of each other

**Slide 6:**Bias and Variance Given true function F(x)
Estimated function g(x; D) from a training set D
Training set D is a random variable. Each training set gives an estimate of error in the fit
Taking average over all training sets of size n, MSE is

**Slide 7:**Bias-Variance Dilemma in Regression

**Slide 8:**Bias Variance Dilemma Procedures with increased flexibility to adapt to training data have lower bias, but higher variance
Large number of parameters
Fits well and have low bias, but high variance
Inflexible procedures have higher bias, but lower variance
Fewer number of parameters
May not fit well to data: have high bias, but low variance
A large amount of training data generally helps improve performance of estimation if the model is sufficiently general to represent the target function
Bias/Variance considerations recommend that we gather as much prior information about the problem as possible to find a best match for the classifier, and as large a dataset as possible to reduce the variance

**Slide 9:**Bias and Variance for Classification Low variance is more important for accurate classification than low boundary bias
Classifiers with large flexibility to adapt to training data (more free parameters) tend to have low bias but high variance
2-class problem; 2D Gaussian distribution with diagonal covariances. Small no. of training data to estimate parameters of 3 different models
For best classification given small training data, need to match model to the true distributions

**Slide 10:**Ensemble-based Systems in Decision Making Polikar R., “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol.6, no. 3, pp. 21-45, 2006
Polikar R., “Bootstrap Inspired Techniques in Computational Intelligence,” IEEE Signal Processing Magazine, vol.24, no. 4, pp. 56-72, 2007
Polikar R., “Ensemble Learning,” Scholarpedia, 2008.

**Slide 11:**Ensemble-based Classifiers Ensemble based systems provide favorable results compared to single-expert systems for a broad range of applications & under a variety of scenarios
How to (i) generate individual components of the ensemble systems (base classifiers), and (ii) how to combine the outputs of individual classifiers?
Popular ensemble based algorithms
Bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts
Commonly used combination rules
Algebraic combination of outputs, voting methods, behavior knowledge space & decision templates

**Slide 12:**Why Ensemble Based Systems? Statistical reasons
A set of classifiers with similar training performances may have different generalization performances
Combining outputs of several classifiers reduces the risk of selecting a poorly performing classifier
Large volumes of data
If the amount of data to be analyzed is too large, a single classifier may not be able to handle it; train different classifiers on different partitions of data
Too little data
Ensemble systems can also be used when there is too little data; resampling techniques

**Slide 13:**Why Ensemble Based Systems? Divide and Conquer
Divide data space into smaller & easier-to-learn partitions; each classifier learns only one of the simpler partitions

**Slide 14:**Why Ensemble Based Systems? Data Fusion
Given several sets of data from various sources, where the nature of features is different (heterogeneous features), training a single classifier may not be appropriate (e.g., MRI data, EEG recording, blood test,..)
Applications in which data from different sources are combined are called data fusion applications
Ensembles have successfully been used for fusion
All ensemble systems must have two key components:
Generate component classifiers of the ensemble
Method for combining the classifier outputs

**Slide 15:**Brief History of Ensemble Systems Dasarathy and Sheela (1979) partitioned the feature space using two or more classifiers
Schapire (1990) proved that a strong classifier can be generated by combining weak classifiers through boosting; predecessor of AdaBoost algorithm
Two types of combination:
classifier selection
Each classifier is trained to become an expert in some local area of the feature space; one or more local experts can be nominated to make the decision
classifier fusion
All classifiers are trained over the entire feature space; fusion involves merging the individual (weaker) classifiers to obtain a single (stronger) expert of superior performance

**Slide 16:**Diversity of Ensemble Objective: create many classifiers, and combine their outputs to improve the performance of a single classifier
Intuition: if each classifier makes different errors, then their strategic combination can reduce the total error!
Need base classifiers whose decision boundaries are adequately different from those of others
Such a set of classifiers is said to be diverse
How to achieve classifier diversity?
Use different training sets to train individual classifiers
How to obtain different training sets?
Resampling techniques: bootstrapping or bagging, training subsets are drawn randomly, usually with replacement, from the entire training set

**Slide 17:**Sampling with Replacement Random & overlapping training sets to train three classifiers; they are combined to obtain a more accurate classification

**Slide 18:**Sampling without Replacement Jackknife or k-fold data split:
Entire data is split into k blocks; each classifier is trained only on different subset of (k-1) blocks

**Slide 19:**Other Approaches to Achieve Diversity Use different training parameters for a classifier
A series of MLP can be trained using different weight initializations, number of layers/nodes, etc.
Adjusting these parameters controls the instability of such classifiers (local minima)
Similar strategy can be used to generate different decision trees for the same problem
Different types of classifiers (MLPs, decision trees, NN classifiers, SVM) can be combined for added diversity
Diversity can also be achieved by using random feature subsets, called random subspace method

**Slide 20:**Creating An Ensemble Two questions:
How will the individual classifiers be generated?
How will they differ from each other?
Answer determines the diversity of classifiers & fusion performance
Seek to improve ensemble diversity by some heuristic methods

**Slide 21:**Bagging Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms
It is also one of the most intuitive and simplest to implement, with a surprisingly good performance
Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data
Each resampled training set is used to train a different classifier of the same type
Individual classifiers are combined by taking a majority vote of their decisions
Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

**Slide 22:**Bagging

**Slide 23:**Variations of Bagging Random Forests
so-called because it is constructed from decision trees
A random forest is created from individual decision trees, whose training parameters vary randomly
Such parameters can be bootstrapped replicas of the training data, as in bagging
But they can also be different feature subsets as in random subspace methods

**Slide 24:**Boosting Boost the performance of a weak learner to the level of a strong one
Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting
resampling is strategically geared to provide the most informative training data for each consecutive classifier
Boosting creates three weak classifiers:
First classifier C1 is trained with a random subset of the available training data
Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1
Third classifier C3 is trained on instances on which both C1 & C2 disagree

**Slide 25:**Boosting

**Slide 26:**AdaBoost AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems
AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses
Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set
This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier
Consecutive classifiers are trained on increasingly hard-to-classify samples

**Slide 27:**AdaBoost A weight distribution Dt(i) on training instances xi , i=1,…,N from which training data subsets St are chosen for each consecutive classifier (hypothesis) ht
A normalized error is then obtained as ?t , such that for 0<?t <1/2, they have 0< ?t <1
Distribution update rule:
The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of ?t , whereas the weights of the misclassified instances are unchanged.
AdaBoost focuses on increasingly difficult instances
AdaBoost raises the weights of instanced misclassified by ht , and lowers the weights of correctly classified instances
AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting
1/?t is therefore a measure of performance, of the tth hypothesis and can be used to weight the classifiers

**Slide 28:**AdaBoost.M1

**Slide 29:**AdaBoost.M1 AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

**Slide 30:**Boosting

**Slide 31:**AdaBoost

**Slide 32:**Performance of AdaBoost In most practical cases, the ensemble error decreases very rapidly in the first few iterations, and approaches zero or stabilizes as new classifiers are added
AdaBoost does not seem to be affected by overfitting; explained by margin theory

**Slide 33:**Stacked Generalization An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes
C1, …,CT are trained using training parameters ?1 through ?T to output hypotheses h1 through hT
The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, CT+1

**Slide 34:**Mixture-of-Experts A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C1, …,CT constitute the ensemble, followed by a second-level classifier CT+1 used for assigning weights for the consecutive combiner
The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all
the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network
The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization)
Mixture-of-experts can, therefore, be seen as a classifier selection algorithm
Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

**Slide 35:**Mixture of Experts The pooling system may use the weights in several different ways.
it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

**Slide 36:**Combining Classifiers How to combine classifiers? Combination rules grouped as
(i) trainable vs. non-trainable
Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm
Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules
Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example
(ii) combination rules for class labels vs. class-specific continuous outputs
combination rules that apply to class labels only need the classification decision (that is, one of ?j , j=1,…,C)
Other rules need continuous-valued outputs of individual classifiers

**Slide 37:**Combining Class Labels Assume that only class labels are available from the classifier outputs
Define the decision of the tth classifier as dt,j?{0,1} , t=1,…,T and j=1,…,C , where T is the number of classifiers and C is the number of classes
If tth classifier chooses class ?j , then dt,j=1, 0 otherwise
Majority Voting :
Weighted Majority Voting –
Behavior Knowledge Space (BKS) – look up Table
Borda Count –
each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N - 1 votes, the second-place candidate receives N - 2, with the candidate in ith place receiving N - i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision

**Slide 38:**Combining Continuous Outputs Algebraic combiners
Mean Rule:
Weighted Average:
Minimum/Maximum/Median Rule:
Product Rule:
Generalized Mean:
Many of the above rules are in fact special cases of the generalized mean
: minimum rule; :maximum rule; :
: mean rule

**Slide 39:**Combining Continuous Outputs

**Slide 40:**Conclusions Ensemble systems are useful in practice
Diversity of the base classifiers is important
Ensemble generation techniques: bagging, AdaBoost, mixture of experts
Classifier combination strategies: algebraic combiners, voting methods, and decision templates.
No single ensemble generation algorithm or combination rule is universally better than others
Effectiveness on real world data depends on the classifier diversity and characteristics of the data