Feature selection focused within error clusters
1 / 18

Feature Selection Focused within Error Clusters - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Feature Selection Focused within Error Clusters. Sui-Yu Wang and Henry Baird Presented by Sui-Yu Wang. Feature Selection. Given a set of n features, find a subset of k < n features that still performs well

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Feature Selection Focused within Error Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Feature Selection Focused within Error Clusters

Sui-Yu Wang and Henry Baird

Presented by Sui-Yu Wang

Feature Selection

  • Given a set of n features, find a subset of k < n features that still performs well

    • Best k features chosen separately are usually not the best k when chosen together (Elashoff et. al, 1967)

    • To select the optimal subset, one has to exhaustively search through all k-elements subsets (Cover and Campenhout, 1977)

    • Given limited number of training samples and features, finding the minimum subset of features without misclassifying any training sample is NP complete (Van Horn and Martinez, 1994)

Feature Selection

  • Methods can be divided into three categories: wrappers, filters, and embedded methods. (Guyon and Elisseeff, 2003)

    • Filters: rank features according to various metrics

    • Wrappers: evaluate subset of features according to given classifier

    • Embedded methods: similar to wrapper, but uses non-exhaustive search methods

A Motivating Example


Task: Classify each pixel into handwriting or blank:

We have to search in a diameter of 25 pixels to get any useful features: D ≈ 450+ pixel values

So possible features can be extremely numerous: any combination of 450 pixel values

Popular Method: PCA

Principal Components Analysis

  • PCA finds a small number of linear combinations of original features

  • PCA finds the dimension that represents the data best in a least square sense, but does not guarantee good separation of data (Pearson, 1901)

  • Most algorithms employee PCA first then operate respective feature selection algorithm on the reduced set

    • Could throw away potentially interesting information

Our Research Strategy

  • We want to find methods for guiding the search for a few strongly discriminating features.

  • We adopt a greedy heuristic: constructing one feature at a time.

  • We focus our search on cases where the current features fail.


  • We assume a two class problem

  • The original sample space is , D is huge

  • We are given d << D hand-crafted features, all samples are projected into this feature space by feature extractor . We may lose information during the process

  • If there is any discriminating information in the sample space but not in the feature space , it is must be in the null space

Finding the Null Space

  • If is linear, the null space can be computed by linear algebra methods

  • Given , a singular value decomposition, or SVD, can be used to find the set of vectors spanning the null space of :

    • can be factorized as where and are orthogonal matrices

    • And

Finding the Next Feature

  • Samples that fall at the same point in are not discriminated by the current feature set

  • Samples that lie in tight clusters in are only weakly discriminated by the current feature set

  • A tight cluster of errors of both classes indicates cases where the current feature set fails completely

  • Therefore, we use these tight clusters to guide the forward search for new features

  • Once we have projected samples from the tight error cluster into the null space, we find a hyperplane that best separates the data, and calculate a given sample x’s distance to this hyperplane, , as the new feature

Operate on Points in the Null Space

  • There are many ways to projects points in the sample space into the null space of ,

    • The orthogonal projection onto a particular subspace is unique

    • Let where is an orthonormal basis for the subspace . Then

Outline of the Algorithm


Draw enough samples to train a classifier

Draw enough samples to build a test set

Find clusters of errors in


Choose a tight cluster with both types of errors

Draw enough samples to populate this cluster (if necessary)

Project the cluster into the null space

Find a separating hyperplane in the null space with

normal vector that best separates the samples in this cluster

Construct a new feature and examine its performance

Until the feature lowers the error rate sufficiently

Until the error rate is satisfactory to the user


  • Experiments were conducted on a document image content extraction problem

    • Each image pixel is treated as a sample

    • The task is to classify each sample into handwriting or machine print

    • Possible features are extracted from a 2525 pixel square, D=625

Experimental Results






  • We divide the data into three sets: training set, discovery set, and test set.

    • The training set consists of 4,469,740 MP samples and 943,178 HW samples

    • The feature discovery set consists of 4,980,418 MP and 1,496,949 HW samples

    • The test set consists of 816,673 MP samples and 649,113 HW samples

Experimental Results

Which Cluster is Best?

  • Experiments suggest that tight balanced clusters are best

Future Work

  • Apply the method to other problems

  • Continue the experiment to see how low the error can drop

  • Analyze cluster statistics to establish rules for selecting better cluster candidate

  • Try other hyperplane-finding methods

  • Establish theoretical framework as to when this approach is guaranteed to work and when it fails

  • Login