- 43 Views
- Uploaded on
- Presentation posted in: General

Geometrical Complexity of Classification Problems

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Geometrical Complexity of

Classification Problems

Tin Kam Ho

Computing Sciences Research Center

Bell Labs, Lucent Technologies

With contributions from Mitra Basu,

Ester Bernado, Martin Law

What Is the Story in this Image?

A fascinating theme.

What will the world be like if machines can do it?

- Robots can see, read, hear, and smell.
- We can get rid of all spam email, and find exactly what we need from the web.
- We can protect everything with our signatures, fingerprints, iris patterns, or just our face.
- We can track anybody by look, gesture, and gait.
- We can be warned of disease outbreaks, terrorist threats.
- Feed in our vital data and we will have perfect diagnosis.
- We will know whom we can trust to lend our money.

And more …

- We can tell the true emotions of all our encounters!
- We can predict stock prices!
- We can identify all potential criminals!
- Our weapons will never lose their targets!
- We will be alerted about any extraordinary events going on in the heavens, on or inside the Earth!
- We will have machines discovering all knowledge!

… But how far are we from these?

Samples

Supervised learning

Unsupervised learning

Feature vectors

Feature vectors

Feature Extraction

Classifier Training

Clustering

classifier

feature 2

Cluster Validation

Classification

Cluster Interpretation

class membership

feature 1

- Statistical Classifiers (for numerical vectors)
- Bayesian classifiers,
- polynomial discriminators,
- nearest-neighbors,
- decision trees & forests,
- neural networks,
- support vector machines,
- ensemble methods, …
Also:

- Syntactic Classifiers (for symbol strings)
- Structural Classifiers (for graphs)

Feature 2

Feature 1

- Why are we still far from perfect?
- What is still missing in our techniques?

Large Variations in Accuracies of Different Classifiers

classifier

application

Will I have any luck for my problem?

- Do they represent the limit of our technology?
- What have they added to the body of methods?
- Is there still value in the older methods?
- What is still missing in our techniques?
When I face a new problem …

- How much can automatic classifiers do?
- How should I choose a classifier?

- Class ambiguity
- Boundary complexity
- Sample size and dimensionality

- Is the concept intrinsically ambiguous?
- Are the classes well defined?
- What information do the features carry?
- Are the features sufficient for discrimination?

- Kolmogorov complexity
- Length can be exponential in dimensionality
- Trivial description: list all points & class labels
- Is there a shorter description?

Training samples for a 2D classification problem

feature 2

feature 1

- XCS (a genetic algorithm): hyperrectangles

Classification boundaries in feature space

feature 2

feature 1

- XCS : hyperrectangles

Classification boundaries in feature space

feature 2

feature 1

- Nearest neighbor : Voronoi cells

Classification boundaries in feature space

feature 2

feature 1

- Nearest neighbor : Voronoi cells

Classification boundaries in feature space

feature 2

feature 1

- Linear Classifier: hyperplanes

Classification boundaries in feature space

feature 2

feature 1

Problem A

Problem B

Better!

Better!

error=0.6%

error=0.06%

error=0.7%

error=1.9%

XCS

NN

NN

XCS

Problem may appear deceptively simple or complex with small samples

2 points

10 points

100 points

500 points

1000 points

Our goal: tools and languages for studying

- Characteristics of geometry & topology of high-dim data sets
- How they change with feature transformations and sampling
- How they interact with classifier geometry
We want to know:

- What are real-world problems like?
- What is my problem like?
- What can be expected of a method on a specific problem?

- Identify key features of data geometry that are relevant for classification
- Develop algorithms to extract such features from a dataset
- Describe patterns of classifier behavior in terms of geometrical features
- … pattern recognition in pattern recognition problems
- … classification of classification problems

- Data sets:
- length of class boundary
- fragmentation of classes / existence of subclasses
- global or local linear separability
- convexity and smoothness of boundaries
- intrinsic / extrinsic dimensionality
- stability of these characteristics as sampling rate changes

- Classifier models:
- polygons, hyper-spheres, Gaussian kernels, axis-parallel hyper-planes, piece-wise linear surfaces, polynomial surfaces, their unions or intersections, …

- Defined for one feature:

means, variances of classes 1,2

- One good feature makes a problem easy
- Take maximum over all features

- Studies from early literature:
- Perceptrons, Perceptron Cycling Theorem, 1962
- Fractional Correction Rule, 1954
- Widrow-Hoff Delta Rule, 1960
- Ho-Kashyap algorithm, 1965

- Many algorithms stop only with positive conclusions.
- But this one works:
- Linear programming, 1968

- Friedman & Rafsky 1979
- Find MST (minimum spanning tree) connecting all points regardless of class
- Count edges joining opposite classes
- Sensitive to separability
and clustering effects

- Lebourgeois & Emptoz 1996
- Packing of same-class
points in hyper-spheres

- Thick and spherical,
or thin and elongated manifolds

- UC-Irvine machine learning archive
- 14 datasets (no missing values, > 500 pts)
- 844 two-class problems
- 452 linearly separable
- 392 linearly nonseparable
- 2 - 4648 points each
- 8 - 480 dimensional feature spaces

- Randomly located and labeled points
- 100 artificial problems
- 1 to 100 dimensional feature spaces
- 2 classes, 1000 points per class

Space of Complexity Measures

Random noise

Linearly separable

- Noise sets and linearly separable sets occupy opposite ends in many dimensions
- In-between positions tell relative difficulty
- Fan-like structure in most plots
- At least 2 independent factors, joint effects
- Noise sets are far from real data
- Ranges of complexity value for the noise sets show effects of apparent complexity under the influence of sampling density

k=99

k=1

Continuum of Difficulty Spanned by Problems of the Same Family

- Study specific application domains, alternative problem formulations
- Guide feature transformations
- Compare classifiers’ domain of competence
- Help in choosing methods
- Set expectations on recognition accuracy

Random labels

?

LC

Complexity measure 2

XCS

Decision

Forest

NN

Complexity measure 1

- Given a classification problem,
determine which classifier is the best for it

Here is my problem !

ensemble methods

ensemble methods

ensemble methods

- Use a set of 9 complexity measures
Boundary, Pretop, IntraInter, NonLinNN, NonLinLP,

Fisher, MaxEff, VolumeOverlap, Npts/Ndim

- Characterize 392 two-class problems from UCI data
All shown to be linearly non-separable

- Evaluate 6 classifiers
NN (1-nearest neighbor)

LP (linear classifier by linear programming)

Odt (oblique decision tree)

Pdfc (random subspace decision forest)

Bdfc (bagging based decision forest)

XCS (a genetic-algorithm based classifier)

Bdfc

+ LP

X NN

O XCS

⃟ Odt

. Several

Bdfc

+ LP

X NN

O XCS

⃟ Odt

. Several

Bdfc

+ LP

X NN

O XCS

⃟ Odt

. Several

Boundary-NonLinNN

IntraInter-Pretop

MaxEff-VolumeOverlap

Bdfc

+ LP

X NN

O XCS

⃟ Odt

. Several

Boundary-NonLinNN

IntraInter-Pretop

MaxEff-VolumeOverlap

ensemble

+ nn,lp,odt

- 270 / 392 problems have a significantly best (dominant) classifier
- 157 / 392 have a significant worst
- 4 classifiers (NN, LP, Bdfc, XCS) are dominant for some problems
- Almost all problems with a dominant classifier are best solved by either NN or LP
- Simple methods (NN, LP) have extreme behavior:
When conditions are favorable, they are optimal for the problem

- Single decision trees (by Odt) have no domain of dominant competence: Significantly worst methods are almost always single decision trees (Odt), followed by LP and NN
- Ensemble methods are in close rivalry for many problems;
they are safe choices when boundary complexity is unknown

- Fisher’s discriminant score Mulitple discriminant scores
- Boundary point in a MST: a point is a boundary point as long as it is next to a point from other classes in the MST

- The complexity measures can be severely affected when there exists intrinsic class ambiguity (or data mislabeling)
- Example: FeatureOverlap (in 1D only)

- Cannot distinguish between intrinsic ambiguity or complex class decision boundary

- Compute the complexity measure at different error levels
- f(D): a complexity measure on the data set D
- D*: a “perturbed” version of D, so that some points are relabeled
- h(D, D*): a distance measure between D and D* (error level)
- The new complexity measure is defined as a curve:
- The curve can be summarized by, say, area under curve

- Minimization by greedy procedures
- Discard erroneous points that decrease complexity by the most

- Boundaries can be simple locally but complex globally
- These types of problems are relatively
simple, but are characterized as

complex by the measures

- These types of problems are relatively
- Solution: complexity measure at different scales
- This can be combined with different error levels

- Let Ni,k be the k neighbors of the i-th point defined by, say, Euclidean distance. The complexity measure for data set D, error level , evaluated at scale k is

E.g. Sparse Sample & Complex Geometry cause ill-posedness

Can’t tell which is better!

Requires further hypotheses on data geometry

To conclude:

- We have some early success in using geometrical measures to characterize classification complexity, pointing to a potentially fruitful research direction.
To continue:

- More, better measures;
- Detailed studies of their utilities, interactions, and estimation uncertainty;
- Deeper understanding of constraints on these measures from point set geometry and topology;
- Apply these to understand practical problems;
- Apply these to make better pattern recognition algorithms.

- Simplifying class boundaries in MR spectra
via feature reduction

with Richard Baumgartner, Ray Somorjai et al.

- Risk assessment of biometrical authentication
with Anil Jain et al.

“Data Complexity In Pattern Recognition”,

M.Basu, T.K. Ho (eds.),

Springer-Verlag, to appear, 2005

A collection of 15 papers about this subject