Feature selection focused within error clusters
1 / 18

Feature Selection Focused within Error Clusters - PowerPoint PPT Presentation

  • Uploaded on

Feature Selection Focused within Error Clusters. Sui-Yu Wang and Henry Baird Presented by Sui-Yu Wang. Feature Selection. Given a set of n features, find a subset of k < n features that still performs well

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Feature Selection Focused within Error Clusters' - gisela-hinton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Feature selection focused within error clusters

Feature Selection Focused within Error Clusters

Sui-Yu Wang and Henry Baird

Presented by Sui-Yu Wang

Feature selection
Feature Selection

  • Given a set of n features, find a subset of k < n features that still performs well

    • Best k features chosen separately are usually not the best k when chosen together (Elashoff et. al, 1967)

    • To select the optimal subset, one has to exhaustively search through all k-elements subsets (Cover and Campenhout, 1977)

    • Given limited number of training samples and features, finding the minimum subset of features without misclassifying any training sample is NP complete (Van Horn and Martinez, 1994)

Feature selection1
Feature Selection

  • Methods can be divided into three categories: wrappers, filters, and embedded methods. (Guyon and Elisseeff, 2003)

    • Filters: rank features according to various metrics

    • Wrappers: evaluate subset of features according to given classifier

    • Embedded methods: similar to wrapper, but uses non-exhaustive search methods

A motivating example
A Motivating Example


Task: Classify each pixel into handwriting or blank:

We have to search in a diameter of 25 pixels to get any useful features: D ≈ 450+ pixel values

So possible features can be extremely numerous: any combination of 450 pixel values

Popular method pca
Popular Method: PCA

Principal Components Analysis

  • PCA finds a small number of linear combinations of original features

  • PCA finds the dimension that represents the data best in a least square sense, but does not guarantee good separation of data (Pearson, 1901)

  • Most algorithms employee PCA first then operate respective feature selection algorithm on the reduced set

    • Could throw away potentially interesting information

Our research strategy
Our Research Strategy

  • We want to find methods for guiding the search for a few strongly discriminating features.

  • We adopt a greedy heuristic: constructing one feature at a time.

  • We focus our search on cases where the current features fail.


  • We assume a two class problem

  • The original sample space is , D is huge

  • We are given d << D hand-crafted features, all samples are projected into this feature space by feature extractor . We may lose information during the process

  • If there is any discriminating information in the sample space but not in the feature space , it is must be in the null space

Finding the null space
Finding the Null Space

  • If is linear, the null space can be computed by linear algebra methods

  • Given , a singular value decomposition, or SVD, can be used to find the set of vectors spanning the null space of :

    • can be factorized as where and are orthogonal matrices

    • And

Finding the next feature
Finding the Next Feature

  • Samples that fall at the same point in are not discriminated by the current feature set

  • Samples that lie in tight clusters in are only weakly discriminated by the current feature set

  • A tight cluster of errors of both classes indicates cases where the current feature set fails completely

  • Therefore, we use these tight clusters to guide the forward search for new features

  • Once we have projected samples from the tight error cluster into the null space, we find a hyperplane that best separates the data, and calculate a given sample x’s distance to this hyperplane, , as the new feature

Operate on points in the null space
Operate on Points in the Null Space

  • There are many ways to projects points in the sample space into the null space of ,

    • The orthogonal projection onto a particular subspace is unique

    • Let where is an orthonormal basis for the subspace . Then

Outline of the algorithm
Outline of the Algorithm


Draw enough samples to train a classifier

Draw enough samples to build a test set

Find clusters of errors in


Choose a tight cluster with both types of errors

Draw enough samples to populate this cluster (if necessary)

Project the cluster into the null space

Find a separating hyperplane in the null space with

normal vector that best separates the samples in this cluster

Construct a new feature and examine its performance

Until the feature lowers the error rate sufficiently

Until the error rate is satisfactory to the user


  • Experiments were conducted on a document image content extraction problem

    • Each image pixel is treated as a sample

    • The task is to classify each sample into handwriting or machine print

    • Possible features are extracted from a 2525 pixel square, D=625


  • We divide the data into three sets: training set, discovery set, and test set.

    • The training set consists of 4,469,740 MP samples and 943,178 HW samples

    • The feature discovery set consists of 4,980,418 MP and 1,496,949 HW samples

    • The test set consists of 816,673 MP samples and 649,113 HW samples

Which cluster is best
Which Cluster is Best?

  • Experiments suggest that tight balanced clusters are best

Future work
Future Work

  • Apply the method to other problems

  • Continue the experiment to see how low the error can drop

  • Analyze cluster statistics to establish rules for selecting better cluster candidate

  • Try other hyperplane-finding methods

  • Establish theoretical framework as to when this approach is guaranteed to work and when it fails