490 likes | 597 Views
Explore automatic pattern recognition advancements in machine learning, examining the capabilities and limitations of classifiers in addressing diverse real-world challenges. Dive into complexities of data geometry and classifier behavior, shedding light on evolving technologies.
E N D
Geometrical Complexity of Classification Problems Tin Kam Ho Computing Sciences Research Center Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law
Automatic Pattern Recognition A fascinating theme. What will the world be like if machines can do it? • Robots can see, read, hear, and smell. • We can get rid of all spam email, and find exactly what we need from the web. • We can protect everything with our signatures, fingerprints, iris patterns, or just our face. • We can track anybody by look, gesture, and gait. • We can be warned of disease outbreaks, terrorist threats. • Feed in our vital data and we will have perfect diagnosis. • We will know whom we can trust to lend our money.
Automatic Pattern Recognition And more … • We can tell the true emotions of all our encounters! • We can predict stock prices! • We can identify all potential criminals! • Our weapons will never lose their targets! • We will be alerted about any extraordinary events going on in the heavens, on or inside the Earth! • We will have machines discovering all knowledge! … But how far are we from these?
Automatic Pattern Recognition Samples Supervised learning Unsupervised learning Feature vectors Feature vectors Feature Extraction Classifier Training Clustering classifier feature 2 Cluster Validation Classification Cluster Interpretation class membership feature 1
Supervised Learning:Many automatically trainable classifiers • Statistical Classifiers (for numerical vectors) • Bayesian classifiers, • polynomial discriminators, • nearest-neighbors, • decision trees & forests, • neural networks, • support vector machines, • ensemble methods, … Also: • Syntactic Classifiers (for symbol strings) • Structural Classifiers (for graphs) Feature 2 Feature 1
We have had some successes. But … • Why are we still far from perfect? • What is still missing in our techniques?
Large Variations in Accuracies of Different Classifiers classifier application Will I have any luck for my problem?
Many new classifiers are in close rivalry with each other. Why? • Do they represent the limit of our technology? • What have they added to the body of methods? • Is there still value in the older methods? • What is still missing in our techniques? When I face a new problem … • How much can automatic classifiers do? • How should I choose a classifier?
Sources of Difficultyin Classification • Class ambiguity • Boundary complexity • Sample size and dimensionality
Class Ambiguity • Is the concept intrinsically ambiguous? • Are the classes well defined? • What information do the features carry? • Are the features sufficient for discrimination?
Boundary Complexity • Kolmogorov complexity • Length can be exponential in dimensionality • Trivial description: list all points & class labels • Is there a shorter description?
Classification Boundaries As Decided by Different Classfiers Training samples for a 2D classification problem feature 2 feature 1
Classification Boundaries • XCS (a genetic algorithm): hyperrectangles Classification boundaries in feature space feature 2 feature 1
Classification Boundaries • XCS : hyperrectangles Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Nearest neighbor : Voronoi cells Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Nearest neighbor : Voronoi cells Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Linear Classifier: hyperplanes Classification boundaries in feature space feature 2 feature 1
Match between Classifiers and Problems Problem A Problem B Better! Better! error=0.6% error=0.06% error=0.7% error=1.9% XCS NN NN XCS
Sampling Density Problem may appear deceptively simple or complex with small samples 2 points 10 points 100 points 500 points 1000 points
Measures of Geometrical Complexity of Classification Problems Our goal: tools and languages for studying • Characteristics of geometry & topology of high-dim data sets • How they change with feature transformations and sampling • How they interact with classifier geometry We want to know: • What are real-world problems like? • What is my problem like? • What can be expected of a method on a specific problem?
Building up the Language • Identify key features of data geometry that are relevant for classification • Develop algorithms to extract such features from a dataset • Describe patterns of classifier behavior in terms of geometrical features • … pattern recognition in pattern recognition problems • … classification of classification problems
Geometry of Datasets and Classifiers • Data sets: • length of class boundary • fragmentation of classes / existence of subclasses • global or local linear separability • convexity and smoothness of boundaries • intrinsic / extrinsic dimensionality • stability of these characteristics as sampling rate changes • Classifier models: • polygons, hyper-spheres, Gaussian kernels, axis-parallel hyper-planes, piece-wise linear surfaces, polynomial surfaces, their unions or intersections, …
Fisher’s Discriminant Ratio • Defined for one feature: means, variances of classes 1,2 • One good feature makes a problem easy • Take maximum over all features
Linear Separability • Studies from early literature: • Perceptrons, Perceptron Cycling Theorem, 1962 • Fractional Correction Rule, 1954 • Widrow-Hoff Delta Rule, 1960 • Ho-Kashyap algorithm, 1965 • Many algorithms stop only with positive conclusions. • But this one works: • Linear programming, 1968
Length of Class Boundary • Friedman & Rafsky 1979 • Find MST (minimum spanning tree) connecting all points regardless of class • Count edges joining opposite classes • Sensitive to separability and clustering effects
Shapes of Class Manifolds • Lebourgeois & Emptoz 1996 • Packing of same-class points in hyper-spheres • Thick and spherical, or thin and elongated manifolds
Data Sets: UCI • UC-Irvine machine learning archive • 14 datasets (no missing values, > 500 pts) • 844 two-class problems • 452 linearly separable • 392 linearly nonseparable • 2 - 4648 points each • 8 - 480 dimensional feature spaces
Data Sets: Random Noise • Randomly located and labeled points • 100 artificial problems • 1 to 100 dimensional feature spaces • 2 classes, 1000 points per class
Space of Complexity Measures Random noise Linearly separable
Patterns in Complexity Measure Spacelin.seplin.nonseprandom
Observations on Distributions of Problems • Noise sets and linearly separable sets occupy opposite ends in many dimensions • In-between positions tell relative difficulty • Fan-like structure in most plots • At least 2 independent factors, joint effects • Noise sets are far from real data • Ranges of complexity value for the noise sets show effects of apparent complexity under the influence of sampling density
k=99 k=1 Continuum of Difficulty Spanned by Problems of the Same Family • Study specific application domains, alternative problem formulations • Guide feature transformations • Compare classifiers’ domain of competence • Help in choosing methods • Set expectations on recognition accuracy Random labels
? LC Complexity measure 2 XCS Decision Forest NN Complexity measure 1 Domains of Competence of Classifiers • Given a classification problem, determine which classifier is the best for it Here is my problem !
ensemble methods ensemble methods ensemble methods Domain of Competence Experiment • Use a set of 9 complexity measures Boundary, Pretop, IntraInter, NonLinNN, NonLinLP, Fisher, MaxEff, VolumeOverlap, Npts/Ndim • Characterize 392 two-class problems from UCI data All shown to be linearly non-separable • Evaluate 6 classifiers NN (1-nearest neighbor) LP (linear classifier by linear programming) Odt (oblique decision tree) Pdfc (random subspace decision forest) Bdfc (bagging based decision forest) XCS (a genetic-algorithm based classifier)
Best Classifier (Seen in Projection to Boundary-NonLinNN) Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier (Seen in Projection to IntraInter-Pretop) Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier (Seen in Projection to MaxEff-VolumeOverlap) Bdfc + LP X NN O XCS ⃟ Odt . Several
Worst Classifier Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier Being nn,lp,odt vs an ensemble technique Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap ensemble + nn,lp,odt
Summary of Observations • 270 / 392 problems have a significantly best (dominant) classifier • 157 / 392 have a significant worst • 4 classifiers (NN, LP, Bdfc, XCS) are dominant for some problems • Almost all problems with a dominant classifier are best solved by either NN or LP • Simple methods (NN, LP) have extreme behavior: When conditions are favorable, they are optimal for the problem • Single decision trees (by Odt) have no domain of dominant competence: Significantly worst methods are almost always single decision trees (Odt), followed by LP and NN • Ensemble methods are in close rivalry for many problems; they are safe choices when boundary complexity is unknown
Extension to Multiple Classes • Fisher’s discriminant score Mulitple discriminant scores • Boundary point in a MST: a point is a boundary point as long as it is next to a point from other classes in the MST
Intrinsic Ambiguity • The complexity measures can be severely affected when there exists intrinsic class ambiguity (or data mislabeling) • Example: FeatureOverlap (in 1D only) • Cannot distinguish between intrinsic ambiguity or complex class decision boundary
Tackling Intrinsic Ambiguity • Compute the complexity measure at different error levels • f(D): a complexity measure on the data set D • D*: a “perturbed” version of D, so that some points are relabeled • h(D, D*): a distance measure between D and D* (error level) • The new complexity measure is defined as a curve: • The curve can be summarized by, say, area under curve • Minimization by greedy procedures • Discard erroneous points that decrease complexity by the most
Global vs. Local Properties • Boundaries can be simple locally but complex globally • These types of problems are relatively simple, but are characterized as complex by the measures • Solution: complexity measure at different scales • This can be combined with different error levels • Let Ni,k be the k neighbors of the i-th point defined by, say, Euclidean distance. The complexity measure for data set D, error level , evaluated at scale k is
Real Problems Have a Mixture of Difficulties E.g. Sparse Sample & Complex Geometry cause ill-posedness Can’t tell which is better! Requires further hypotheses on data geometry
To conclude: • We have some early success in using geometrical measures to characterize classification complexity, pointing to a potentially fruitful research direction. To continue: • More, better measures; • Detailed studies of their utilities, interactions, and estimation uncertainty; • Deeper understanding of constraints on these measures from point set geometry and topology; • Apply these to understand practical problems; • Apply these to make better pattern recognition algorithms.
Ongoing Applications • Simplifying class boundaries in MR spectra via feature reduction with Richard Baumgartner, Ray Somorjai et al. • Risk assessment of biometrical authentication with Anil Jain et al. “Data Complexity In Pattern Recognition”, M.Basu, T.K. Ho (eds.), Springer-Verlag, to appear, 2005 A collection of 15 papers about this subject