Skip this Video
Download Presentation
Combining Multiple Classifiers

Loading in 2 Seconds...

play fullscreen
1 / 57

Combining Multiple Classifiers - PowerPoint PPT Presentation

  • Uploaded on

Combining Multiple Classifiers. Pattern Recognition Best possible classification rates Increase efficiency & accuracy Multiple classifier systems Improve generalization, robustness, and accuracy. Combining Multiple Classifiers. Why?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Combining Multiple Classifiers' - shaun

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Combining Multiple Classifiers

  • Pattern Recognition
    • Best possible classification rates
    • Increase efficiency & accuracy
  • Multiple classifier systems
    • Improve generalization, robustness, and accuracy

Combining Multiple Classifiers

  • Why?
    • Multiple classifiers are available, but none of them are perfect
    • Multiple types of features can be extracted for a given pattern
    • Certain complimentary properties exist among different
    • classifiers and different features.
  • Issues
    • How many classifiers are needed?
    • What kind of classifiers should be used?
    • What features to be used in each classifier?
    • How to combine results from different classifiers?

Need for Combination

  • There are number of different classifiers.
  • Sometimes more than a single training set is available.
  • Different classifiers may show strong local differences.
  • Some classifiers show different results with different parameters, one can combine them, thereby taking advantage of all the attempts to learn from the data.
  • The training data may not provide sufficient information for choosing a single best classifier from the hypothesis space.
  • The learning algorithms may not be able to solve the difficult search problems.
  • The hypothesis space may not contain the true classification, instead, it may include several equally good estimates.

Combination Methods

  • For different applications we may have different feature sets, different training sets, different classification methods or different training sessions, all resulting in a set of classifiers whose outputs may be combined, with the hope of improving the overall classification accuracy.
    • hybrid methods, decision combination, multiple experts, mixture of experts, classifier ensambles, cooperative agents, union pool, sensor fusion, and more ...
  • Various combination methods may differ from each other in their architectures, the characteristics of the combiner, and selection of the individual classifiers.
    • - Parallel - Serial -Hierarchical


  • Majority Voting Principle
  • A pattern is assigned to the class which receives the highest vote
  • from multiple classifiers.
  • Re-Ranking (Re-Ordering) Approaches:
  • Each classifier produces a set of ranked candidates, and the
  • candidates in the union of all the individual sets are re-ranked based
  • on their old ranks in each set.
  • Hierarchical Re-Ranking Approach:
  • All the classifiers are ordered based on their individual performance.
  • A classifier is used for re-ordering only if its predecessors are not
  • `confident` in their ranking.

Bayesian Optimization Techniques:

  • The idea is to minimize the probability of error given all the decisions
  • made by individual classifiers.
  • Linear Combination Methods:
  • New decision is made based on a linear combination of the
  • confidence measures of individual classifiers.
  • Dempster-Shafer Theory:
  • Dempster-Shafer Theory provides a method for combining the
  • contribution of individual classifiers to give the final result.

Classifier Combination

According to the Bayesian theory, given a specific feature vector xtetk, the sample test pattern xtet , should be assigned to class wc, provided the a posteriori probability of that interpretation is maximum,

Assign xtetwc if

Let us rewrite the a posteriori probability P(wc|xtet1,…,xtetK) using

the Bayes theorem. We have ,

where the unconditional joint probability density function can be

expressed in terms of conditional probabilities as


Product Rule

  • Let us assume that the classifiers are statistically independent, which will lead us to the rewrite the joint probability density function .
  • Product Rule is assign xtetwc if
  • In terms of the posterior probabilities the rule could be written as:
  • Assign xtetwc if

Min Rule

We can approximate the product rule with min rule, by bounding

the product of posterior probabilities from above:

we obtain

Assign xtetwc if

If we further assume that the prior probabilities are equal, this simplifies to

Assign xtetwc if


Sum Rule

  • Let us assume that the posterior probabilities can be expressed as P(wc|xtetk)=P(wc)(1+ck), where ck satisfies |ck|<<1.
  • If we expand the product and neglect any terms of second and higher order, we can approximate the right-hand side as
  • This simplifies to the sum rule
  • Assign xtetwc if

Max Rule

We can approximate the sum rule by the maximum of the posterior probabilities, since

Assign xtetwc if

If we further assume that the prior probabilities are equal, this simplifies to

Assign xtetwc if


Mean Rule

If we assume equal prior probabilities, the sum rule can be viewed as computing the average posterior probabilities for each class over all the classifier outputs:

Assign xtetwc if

Assign xtetwc if


Majority Vote Rule

Let us force the posterior probabilities to produce binary valued function ctk=1 if and 0 otherwise.

This function results in combining decision outputs to be class labels rather than posterior probabilities. If we further assume that the prior probabilities are equal we find:

Assign xtetwc if

Note that for each class wi the sum on the RHS is the count of the

votes received from the individual classifiers. The class, which

receives the largest number of votes, is then selected as majority



Error Sensitivity

  • and assume that eitk<<P(wi|xtetk), P(wi|xtetk)>0
  • Product rule error factor
  • Sum rule error factor
  • The sum decision rule is much more reliable to estimation errors.

Performance Measures

  • Performance
  • Reliability

(ratio for correct classification)

(probability of correct classification for that class)


Performance Measures

  • Class performance
  • Probability performance

(ratio of correct classification to the sample size)

(based on distances of the posterior probabilities p’t of the classification result and

the true classification probability pt. )


Performance Measures

  • Overall classification performance
  • Sum of squared errors based on probabilities

(combines the products of performancei and reliabilityi with the counts of samplesi

at corresponding classi as a weight.)

(sum of squared differences of the posterior probabilities of the classification result

and the true classification probability.)


Performance Measures

  • Distance of probabilities

(Euclidean distance between posterior probabilities of the classification result and

the true classification probability.)


Class based combination

Classifier’s class assignments are used for combination.

The classifiers are forced to produce binary valued function ctk using the posterior probabilities as: ctk=1 if and 0 otherwise.

Assign xtetwc if

Assign xtetwc if


Probability based combination

We use the posterior probabilities of classifiers to carry out the


  • Assign xtetwc if
  • Assign xtetwc if

Combined class and probability based combination

We can convert the class assignments of class and probability based

combination algorithm to posterior probabilities.


Combined class and probability based combination

  • Similarly, we can integrate reliability
  • Assign xtetwc if

Weight assignment for combination

  • Equal weights:
  • Normalized overall performance are assigned as weights:

Weight assignment for combination

Another proposal is to assign weights using a linear fit on the posterior

probabilities of leave-one-out results.

  • Least square fit parameters for the training data set is used as weights of the classifiers in the combination:
  • Integration of the reliability of the classifier for the assigned class:

Classifier set

KMClus: K-means clustering with maximum iteration=10; maximum error=0.5.

SOM : Self organizing map clustering with iteration=1000; learning rate=1.

FANN : Fuzzy neural network classifier with fuzzification level=3; fuzzification type=0; number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02.

ANN : Artificial neural network classifier with number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02.


Classifier set

KMClas: K-means classifier.

Parzen : Parzen classifier with alfa=1.

KNN : K-nearest neighbour classifier with k=3.

PQD : Piecewise quadratic distance classifier.

PLD : Piecewise linear distance classifier.

SVC : Support vector machine using radial basis kernel with p=1


Data sets

BIO : Cariers of a rare genetic disorder. 5(127+67)

DIB : Pigma Indians Diabetes. 8(500+268)

D10 : Duin 10 dimensional distribution. 10(100+100)

GID : Glass Identification. 9(70+76+17+13+9+29)

IMX : IMOX IEEE data file of letters. 8(48+48+48+48)


Data sets

SMR: Sonar. 60(97+111)

2SD : Two spirals two dimensional. 2(97+97)

WQD: Wine quality. 13(59+71+48)

80X : IEEE 80X data set. 8(15+15+15)

ZMM: 6 Zernike moments of 8 characters. 6(12+12+12+12+12+12+12+12)


Data sets

BEM: Equal mean but different variance(20 % Bayes error). 2(100+100)

BEV : Different mean but equal variance (20 % Bayes error). 2(100+100)

HRD: Highleyman random patterns. 2(100+100)

IFD : Classical Fisher’s iris flowers. 4(50+50+50)


Increasing the learning performance

linear linear

poly3 third degree polynomial

rbf radial basis with unit width

erbf radial basis with a unit width and square root of distances

sigmoid sigmoid with scale one and no offset

fourier fourier with zero degree

spline spline

bspline third degree bspline.


Sensitivity Analysis

  • Removal of worst classifiers
  • Removal of best classifier
  • Best classifier subset
  • Incremental classifier addition

Sensitivity of incremental addition of classifiers

Class performance of equal weight assignment

Class performance of polynomial weight assignment with reliability


Design Guidelines

  • All classifiers may be tuned for better performance.
  • Leave-one-out algorithm should be used.
  • The reliabilities should be calculated.
  • A sensitivity analysis should be carried out.
  • Either the best subset or if the difference is not so high, all classifiers should be selected for classifier set.
  • Finally using results of leave-one-out, the polynomial weights and the class reliabilities of classifiers should be calculated and used for the testing phase.
  • The probability based combination algorithm is more robust as the combination algorithm.

Data Handling

  • In Crossvalidation, some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the classifier.
  • The holdout method is the simplest kind of crossvalidation. The data set is separated into two sets, called the training set and the testing set.
  • In k-fold cross-validation, the cases are randomly divided into k mutually exclusive test partitions of approximately equal size. The holdout method is repeated k times.
  • Leave-one-out crossvalidation is k-fold crossvalidation taken to its logical extreme, with k equal to N, the classifier is trained on all the data except for one sample.

Data Handling

  • Bootstrapping repeatedly analyzes subsamples of the data, each subsample is a random sample with replacement from the full sample.
  • In bagging, the training set is perturbed by bootstrapping and the results combined with a majority vote.
  • Boosting operates on random subsets, constructs a filtered sequence of classifiers that increases the probability of selecting previously misclassified patterns followed by weighted vote combination. At each iteration, the training set is modified to focus the classification algorithm on samples that are difficult to classify correctly.
  • On combining Classifiers. Josef Kittle, IEEE computer society.
  • Strategies for combining classifiers employing shared and distinct pattern representations. Josef Kittle, Pattern recognition letters.
  • Data Dependance in combining classifiers. M Kamel, N Wanas, Pattern Analysis and machine intelligence lab.
  • How good is this. Krose, Machine language learning.
  • Pattern Classification. R Duda, P Hart.