Combining Multiple Classifiers

Combining Multiple Classifiers • Pattern Recognition • Best possible classification rates • Increase efficiency & accuracy • Multiple classifier systems • Improve generalization, robustness, and accuracy

Combining Multiple Classifiers • Why? • Multiple classifiers are available, but none of them are perfect • Multiple types of features can be extracted for a given pattern • Certain complimentary properties exist among different • classifiers and different features. • Issues • How many classifiers are needed? • What kind of classifiers should be used? • What features to be used in each classifier? • How to combine results from different classifiers?

Need for Combination • There are number of different classifiers. • Sometimes more than a single training set is available. • Different classifiers may show strong local differences. • Some classifiers show different results with different parameters, one can combine them, thereby taking advantage of all the attempts to learn from the data. • The training data may not provide sufficient information for choosing a single best classifier from the hypothesis space. • The learning algorithms may not be able to solve the difficult search problems. • The hypothesis space may not contain the true classification, instead, it may include several equally good estimates.

Combination Methods • For different applications we may have different feature sets, different training sets, different classification methods or different training sessions, all resulting in a set of classifiers whose outputs may be combined, with the hope of improving the overall classification accuracy. • hybrid methods, decision combination, multiple experts, mixture of experts, classifier ensambles, cooperative agents, union pool, sensor fusion, and more ... • Various combination methods may differ from each other in their architectures, the characteristics of the combiner, and selection of the individual classifiers. • - Parallel - Serial -Hierarchical

Approaches • Majority Voting Principle • A pattern is assigned to the class which receives the highest vote • from multiple classifiers. • Re-Ranking (Re-Ordering) Approaches: • Each classifier produces a set of ranked candidates, and the • candidates in the union of all the individual sets are re-ranked based • on their old ranks in each set. • Hierarchical Re-Ranking Approach: • All the classifiers are ordered based on their individual performance. • A classifier is used for re-ordering only if its predecessors are not • `confident` in their ranking.

Bayesian Optimization Techniques: • The idea is to minimize the probability of error given all the decisions • made by individual classifiers. • Linear Combination Methods: • New decision is made based on a linear combination of the • confidence measures of individual classifiers. • Dempster-Shafer Theory: • Dempster-Shafer Theory provides a method for combining the • contribution of individual classifiers to give the final result.

Classifier Combination According to the Bayesian theory, given a specific feature vector xtetk, the sample test pattern xtet , should be assigned to class wc, provided the a posteriori probability of that interpretation is maximum, Assign xtetwc if Let us rewrite the a posteriori probability P(wc|xtet1,…,xtetK) using the Bayes theorem. We have , where the unconditional joint probability density function can be expressed in terms of conditional probabilities as

Product Rule • Let us assume that the classifiers are statistically independent, which will lead us to the rewrite the joint probability density function . • Product Rule is assign xtetwc if • In terms of the posterior probabilities the rule could be written as: • Assign xtetwc if

Min Rule We can approximate the product rule with min rule, by bounding the product of posterior probabilities from above: we obtain Assign xtetwc if If we further assume that the prior probabilities are equal, this simplifies to Assign xtetwc if

Sum Rule • Let us assume that the posterior probabilities can be expressed as P(wc|xtetk)=P(wc)(1+ck), where ck satisfies |ck|<<1. • If we expand the product and neglect any terms of second and higher order, we can approximate the right-hand side as • This simplifies to the sum rule • Assign xtetwc if

Max Rule We can approximate the sum rule by the maximum of the posterior probabilities, since Assign xtetwc if If we further assume that the prior probabilities are equal, this simplifies to Assign xtetwc if

Mean Rule If we assume equal prior probabilities, the sum rule can be viewed as computing the average posterior probabilities for each class over all the classifier outputs: Assign xtetwc if Assign xtetwc if

Majority Vote Rule Let us force the posterior probabilities to produce binary valued function ctk=1 if and 0 otherwise. This function results in combining decision outputs to be class labels rather than posterior probabilities. If we further assume that the prior probabilities are equal we find: Assign xtetwc if Note that for each class wi the sum on the RHS is the count of the votes received from the individual classifiers. The class, which receives the largest number of votes, is then selected as majority decision.

Error Sensitivity • and assume that eitk<<P(wi|xtetk), P(wi|xtetk)>0 • Product rule error factor • Sum rule error factor • The sum decision rule is much more reliable to estimation errors.

Performance Measures • Performance • Reliability (ratio for correct classification) (probability of correct classification for that class)

Performance Measures • Class performance • Probability performance (ratio of correct classification to the sample size) (based on distances of the posterior probabilities p’t of the classification result and the true classification probability pt. )

Performance Measures • Overall classification performance • Sum of squared errors based on probabilities (combines the products of performancei and reliabilityi with the counts of samplesi at corresponding classi as a weight.) (sum of squared differences of the posterior probabilities of the classification result and the true classification probability.)

Performance Measures • Distance of probabilities (Euclidean distance between posterior probabilities of the classification result and the true classification probability.)

Class based combination Classifier’s class assignments are used for combination. The classifiers are forced to produce binary valued function ctk using the posterior probabilities as: ctk=1 if and 0 otherwise. Assign xtetwc if Assign xtetwc if

Probability based combination We use the posterior probabilities of classifiers to carry out the combination. • Assign xtetwc if • Assign xtetwc if

Combined class and probability based combination We can convert the class assignments of class and probability based combination algorithm to posterior probabilities.

Combined class and probability based combination • Assign xtetwc if

Combined class and probability based combination • Similarly, we can integrate reliability • Assign xtetwc if

Weight assignment for combination • Equal weights: • Normalized overall performance are assigned as weights:

Weight assignment for combination Another proposal is to assign weights using a linear fit on the posterior probabilities of leave-one-out results. • Least square fit parameters for the training data set is used as weights of the classifiers in the combination: • Integration of the reliability of the classifier for the assigned class:

Result file

Classifier set KMClus: K-means clustering with maximum iteration=10; maximum error=0.5. SOM : Self organizing map clustering with iteration=1000; learning rate=1. FANN : Fuzzy neural network classifier with fuzzification level=3; fuzzification type=0; number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02. ANN : Artificial neural network classifier with number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02.

Classifier set KMClas: K-means classifier. Parzen : Parzen classifier with alfa=1. KNN : K-nearest neighbour classifier with k=3. PQD : Piecewise quadratic distance classifier. PLD : Piecewise linear distance classifier. SVC : Support vector machine using radial basis kernel with p=1

Data sets BIO : Cariers of a rare genetic disorder. 5(127+67) DIB : Pigma Indians Diabetes. 8(500+268) D10 : Duin 10 dimensional distribution. 10(100+100) GID : Glass Identification. 9(70+76+17+13+9+29) IMX : IMOX IEEE data file of letters. 8(48+48+48+48)

Data sets SMR: Sonar. 60(97+111) 2SD : Two spirals two dimensional. 2(97+97) WQD: Wine quality. 13(59+71+48) 80X : IEEE 80X data set. 8(15+15+15) ZMM: 6 Zernike moments of 8 characters. 6(12+12+12+12+12+12+12+12)

Data sets BEM: Equal mean but different variance(20 % Bayes error). 2(100+100) BEV : Different mean but equal variance (20 % Bayes error). 2(100+100) HRD: Highleyman random patterns. 2(100+100) IFD : Classical Fisher’s iris flowers. 4(50+50+50)

Time performance of classifiers

Performance of classifiers on data sets

Performance of boosting a classifier using weighted combination

Increasing the learning performance linear linear poly3 third degree polynomial rbf radial basis with unit width erbf radial basis with a unit width and square root of distances sigmoid sigmoid with scale one and no offset fourier fourier with zero degree spline spline bspline third degree bspline.

Increasing the learning performance

Performance of boosting a classifier’ learning performance

Performance of class based classifier combination

Performance of probability based classifier combination

Performance of combined classifier combination

Sensitivity Analysis • Removal of worst classifiers • Removal of best classifier • Best classifier subset • Incremental classifier addition

Class based combination - No Clustering

Probability based combination - No Clustering

Combined combination - No Clustering

Class based combination - No K-NN

Probability based combination - No K-NN

Combined combination - No K-NN

Sum of Squred Errors on probabilities of classifier set

Combining Multiple Classifiers

Combining Multiple Classifiers

Presentation Transcript

Classifiers

Combining Multiple References

Combining Multiple Modes of Information using Unsupervised Neural Classifiers

Predicting Income from Census Data using Multiple Classifiers

Classifiers

LECTURE 23: ESTIMATING, COMPARING AND COMBINING CLASSIFIERS

Unsupervised medical image classification by combining case -based classifiers

LECTURE 20: ESTIMATING, COMPARING AND COMBINING CLASSIFIERS

Objectives: Cross -Validation ML and Bayesian Model Comparison Combining Classifiers

Classifiers

Classifiers

Graphical models for combining multiple data sources

Classifiers

Combining Classifiers to Identify Online Databases

Combining Multiple Images with Different Focus Depths

Data Dependence in Combining Classifiers

Combining prevalence estimates from multiple sources

Combining Resources: Taxonomy Extraction from Multiple Dictionaries

Classifiers

“Classifiers”

Classifiers!!!

Classifiers