Feature selection in nonlinear kernel classification l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Feature Selection in Nonlinear Kernel Classification PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Workshop on Optimization-Based Data Mining Techniques with Applications IEEE International Conference on Data Mining Omaha, Nebraska, October 28, 2007. Feature Selection in Nonlinear Kernel Classification. Olvi Mangasarian & Edward Wild University of Wisconsin Madison.

Download Presentation

Feature Selection in Nonlinear Kernel Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Feature selection in nonlinear kernel classification l.jpg

Workshop on Optimization-Based Data Mining Techniques with Applications

IEEE International Conference on Data Mining

Omaha, Nebraska, October 28, 2007

Feature Selection in Nonlinear Kernel Classification

Olvi Mangasarian & Edward Wild

University of Wisconsin

Madison


Example l.jpg

However, data is nonlinearly separable using only the feature x2

Best linear classifier that uses only 1 feature selects the feature x1

Example

x2

Data is nonlinearly separable: In general nonlinear kernels use both x1 and x2

+

+

+

+

+

+

+

+

_

_

_

_

_

_

_

_

+

+

+

+

+

+

+

+

x1

Feature selection in nonlinear classification is important


Outline l.jpg

Outline

  • Minimize the number of input space features selected by a nonlinear kernel classifier

  • Start with a standard 1-norm nonlinear support vector machine (SVM)

  • Add 0-1 diagonal matrix to suppress or keep features

    • Leads to a nonlinear mixed-integer program

  • Introduce algorithm to obtain a good local solution to the resulting mixed-integer program

  • Evaluate algorithm on two public datasets from the UCI repository and synthetic NDCC data


Support vector machines l.jpg

_

_

_

+

+

+

+

+

+

_

_

+

+

+

+

+

_

+

+

_

+

+

+

_

+

_

+

_

_

_

_

_

_

_

_

_

_

_

_

_

_

Linear kernel: (K(A, B))ij = (AB)ij = AiB¢j = K(Ai, B¢j)

Gaussian kernel, parameter : (K(A, B))ij = exp(-||Ai0-B¢j||2)

SVMs

Support Vector Machines

  • x 2Rn

  • SVM defined by parameters u and threshold  of the nonlinear surface

  • A contains all data points

    {+…+} ½A+

    {…} ½ A

  • e is a vector of ones

K(A+, A0)u¸ e +e

K(A, A0)u· ee

Minimize e0y (hinge loss or plus function or max{•, 0}) to fit data

Minimize e0s (||u||1 at solution) to reduce overfitting

K(x0, A0)u = 

K(x0, A0)u = 

Slack variable y¸ 0 allows points to be on the wrong side of the bounding surface

K(x0, A0)u = 1


Reduced feature svm l.jpg

To suppress features, add the number of features present (e0Ee) to the objective with weight  ¸ 0

As  is increased, more features will be removed from the classifier

Reduced Feature SVM

Start with Full SVM

Replace A with AE, where E is a diagonal n£n matrix with Eii2 {1, 0}, i = 1, …, n

All features are present in the kernel matrix K(A, A0)

If Eii is 0 the ith feature is removed


Reduced feature svm rfsvm l.jpg

Reduced Feature SVM (RFSVM)

  • Initialize diagonal matrix E randomly

  • For fixed 0-1 values E, solve the SVM linear program to obtain (u, , y, s)

    3)Fix (u, , s) and sweep through E repeatedly as follows:

    • For each component of E replace 1 by 0 and conversely provided the change decreases the overall objective function by more than tol

      4)Go to (3) if a change was made in the last sweep, otherwise continue to (5)

      5)Solve the SVM linear program with the new matrix E. If the objective decrease is less than tol, stop, otherwise go to (3)


Rfsvm convergence for tol 0 l.jpg

RFSVM Convergence (for tol = 0)

  • Objective function value converges

    • Each step decreases the objective

    • Objective is bounded below by 0

  • Limit of the objective function value is attained at any accumulation point of the sequence of iterates

  • Accumulation point is a “local minimum solution”

    • Continuous variables are optimal for the fixed integer variables

    • Changing any single integer variable will not decrease the objective


Experimental results l.jpg

Experimental Results

  • Classification accuracy versus number of features used

    • Compare our RFSVM to Relief and RFE (Recursive Feature Elimination)

    • Results given on two public datasets from the UCI repository

  • Ability of RFSVM to handle problems with up to 1000 features tested on synthetic NDCC datasets

    • Set feature selection parameter = 1


Relief and rfe l.jpg

Relief and RFE

  • Relief

    • Kira and Rendell, 1992

    • Filter method: feature selection is a preprocessing procedure

    • Features are selected as relevant if they tend to have different feature values for points in different classes

  • RFE (Recursive Feature Elimination)

    • Guyon, Weston, Barnhill, and Vapnik, 2002

    • Wrapper method: feature selection is based on classification

    • Features are selected as relevant if removing them causes a large change in the margin of an SVM


Ionosphere dataset 351 points in r 34 l.jpg

If the appropriate value of  is selected, RFSVM can obtain higher accuracy using fewer features than SVM1

Ionosphere Dataset 351 Points in R34

Nonlinear SVM with no feature selection

Even for feature selection parameter  = 0, some features may be removed when removing them decreases the hinge loss

Cross-validation accuracy

Note that accuracy decreases slightly until about 10 features remain, and then decreases more sharply as they are removed

Linear 1-norm SVM

Number of features used


Normally distributed clusters on cubes dataset thompson 2006 l.jpg

  • Points are generated from normal distributions centered at vertices of 1-norm cubes

  • Dataset is not linearly separable

Normally Distributed Clusters on Cubes Dataset (Thompson, 2006)


Slide12 l.jpg

Average Accuracy on 1000 Test Points

RFSVM

0.70

NKSVM1

0.53

Each point is the average test set correctness over 10 datasets with 200 training, 200 tuning, and 1000 testing points

RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 100 True Features and 1000 Irrelevant Features

RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 20 True Features and Varying Numbers of Irrelevant Features

When 480 irrelevant features are added, the accuracy of RFSVM is 45% higher than that of NKSVM1


Conclusion l.jpg

Conclusion

  • New rigorous formulation with precise objective for feature selection in nonlinear SVM classifiers

    • Obtain a local solution to the resulting mixed-integer program

    • Alternate between a linear program to compute continuous variables and successive sweeps to update the integer variables

  • Efficiently learns accurate nonlinear classifiers with reduced numbers of features

  • Handles problems with 1000 features, 900 of which are irrelevant


Questions l.jpg

Questions?

  • Websites with links to papers and talks

    • http://www.cs.wisc.edu/~olvi

    • http://www.cs.wisc.edu/~wildt

  • NDCC generator

    • http://www.cs.wisc.edu/dmi/svm/ndcc/


Running time on the ionosphere dataset l.jpg

Running Time on the Ionosphere Dataset

  • Averages 5.7 sweeps through the integer variables

  • Averages 3.4 linear programs

  • 75% of the time consumed in objective function evaluations

  • 15% of time consumed in solving linear programs

  • Complete experiment (1960 runs) took 1 hour

    • 3 GHz Pentium 4

    • Written in MATLAB

    • CPLEX 9.0 used to solve the linear programs

    • Gaussian kernel written in C


Sonar dataset 208 points in r 60 l.jpg

Sonar Dataset208 Points in R60

Cross-validation accuracy

Number of features used


Related work l.jpg

Related Work

  • Approaches that use specialized kernels

    • Weston, Mukherjee, Chapelle, Pontil, Poggio, and Vapnik, 2000: structural risk minimization

    • Gold, Holub, and Sollich, 2005: Bayesian interpretation

    • Zhang, 2006: smoothing spline ANOVA kernels

  • Margin-based approach

    • Frölich and Zell, 2004: remove features if there is little change to the margin if they are removed

  • Other approaches which combine feature selection with basis reduction

    • Bi, Bennett, Embrechts, Breneman, and Song, 2003

    • Avidan, 2004


Future work l.jpg

Future Work

  • Datasets with more features

    • Reduce the number of objective function evaluations

    • Limit the number of integer cycles

    • Other ways to update the integer variables

  • Application to regression problems

  • Automatic choice of 


Algorithm l.jpg

Algorithm

  • Global solution to nonlinear mixed-integer program cannot be found efficiently

    • Requires solving 2n linear programs

  • For fixed values of the integer diagonal matrix, the problem is reduced to an ordinary SVM linear program

  • Solution strategy: alternate optimization of continuous and integer variables:

    • For fixed values of E, solve a linear program for (u, , y, s)

    • For fixed values of (u, , s), sweep through the components of E and make updates which decrease the objective function


Notation l.jpg

Notation

  • Data points represented as rows of an m£n matrix A

  • Data labels of +1 or -1 are given as elements of an m £ m diagonal matrix D

  • Example

    • XOR: 4 points in R2

    • Points (0, 1) , (1, 0) have label +1

    • Points (0, 0) , (1, 1) have label 1

  • Kernel K(A, B) : Rm£n£Rn£k!Rm£k

    • Linear kernel: (K(A, B))ij = (AB)ij = AiB¢j = K(Ai, B¢j)

    • Gaussian kernel, parameter : (K(A, B))ij = exp(-||Ai0-B¢j||2)


Methodology l.jpg

Methodology

  • UCI Datasets

    • To reduce running time, 1/11 of each dataset was used as a tuning set to select  and the kernel parameter

    • Remaining 10/11 used for 10-fold cross validation

    • Procedure repeated 5 times for each dataset with different random choice of tuning set each time

  • NDCC

    • Generate multiple datasets with 200 training, 200 tuning, and 1000 testing points


  • Login