Feature Selection in Nonlinear Kernel Classification

1 / 22

# Feature Selection in Nonlinear Kernel Classification - PowerPoint PPT Presentation

Feature Selection in Nonlinear Kernel Classification. Olvi Mangasarian Edward Wild University of Wisconsin Madison. Problem. Input space feature selection in nonlinear kernel classification Mixed-integer programming problem Inherently difficult (NP-hard) Most research of recent vintage.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Feature Selection in Nonlinear Kernel Classification

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Feature Selection in Nonlinear Kernel Classification

Olvi Mangasarian

Edward Wild

University of Wisconsin

Problem
• Input space feature selection in nonlinear kernel classification
• Mixed-integer programming problem
• Inherently difficult (NP-hard)
• Most research of recent vintage

Best linear classifier that uses only 1 feature selects the feature x1

Example

x2

+

+

+

+

+

+

+

+

_

_

_

_

_

_

_

_

+

+

+

+

+

+

+

+

x1

Hence, feature selection in nonlinear classification is important

Outline
• Add 0-1 diagonal matrix to suppress or keep features
• Leads to a nonlinear mixed-integer program
• Introduce algorithm to obtain a good local solution to the resulting mixed-integer program
• Evaluate algorithm on five public datasets from the UCI repository
Notation
• Data points represented as rows of an m£n matrix A
• Data labels of +1 or -1 are given as elements of an m £ m diagonal matrix D
• XOR Example: 4 points in R2
• Points (0, 1) , (1, 0) have label +1
• Points (0, 0) , (1, 1) have label 1
• Kernel K(A, B) : Rm£n£Rn£k!Rm£k
• For x2Rn, the commonly used Gaussian kernel with parameter , the ith component of K(x0, A0) is: K(x0, A0)i = K(x0, A0i) = exp(-||xAi||2)

_

_

_

+

+

+

+

+

+

_

_

+

+

+

+

+

_

+

+

_

+

+

+

_

+

_

+

_

_

_

_

_

_

_

_

_

_

_

_

_

_

SVMs

Support Vector Machines
• x 2Rn
• e is a vector of ones
• SVM defined by the parameters (u, ) of the nonlinear surface
• A contains all data points

{+…+} ½A+

{…} ½ A

K(A+, A0)u¸ e +e

K(A, A0)u· ee

Minimize e0y (hinge loss or plus function or max{•, 0}) to fit data

Minimize e0s (||u||1) to reduce overfitting

K(x0, A0)u = 

K(x0, A0)u = 

Slack variable y¸ 0 allows points to be on the wrong side of the bounding surface

K(x0, A0)u = 1

To suppress features, add the number of features present (e0Ee) to the objective

As parameter ¸ 0 is increased, more features will be removed from the classifier

Reduced Feature SVM Starting with Full SVM

Replace A with AE, where E is a diagonal n£n matrix with Eii2 {1, 0}, i = 1, …, n

All features are present in the kernel matrix K(A, A0)

If Eii is 0 the ith feature is removed

Algorithm
• Global solution to nonlinear mixed-integer program cannot be found efficiently
• Requires solving 2n linear programs
• For fixed values of the integer diagonal matrix, the problem is reduced to an ordinary SVM linear program
• Solution strategy: alternate optimization of continuous and integer variables:
• For fixed values of E, solve a linear program for (u, , y, s)
• For fixed values of (u, , s), sweep through the components of E and make updates which decrease the objective function
Reduced Feature SVM (RFSVM) Algorithm
• Initialize E randomly
• Cardinality of E inversely proportional to 
• For large  preponderance of 0 elements in E
• For small  preponderance of 1 elements in E
• Let k be the maximum number of sweeps, tol be the stopping tolerance
• For fixed integer values E, solve the SVM linear program to obtain (u, , y, s). Sweep through E repeatedly as follows:
• For each component of E replace 1 by 0 and conversely provided the change decreases the overall objective function by more than tol. Go to (4) if no change was made in the last sweep, or k sweeps have been completed
• Solve the SVM linear program with the new matrix E. If the objective decrease is less than tol, stop, otherwise go to (3)
RFSVM Convergence (for tol = 0)
• Objective function value converges
• Each step decreases the objective
• Objective is bounded below by 0
• Limit of the objective function value attained at any accumulation point of the sequence of iterates
• Accumulation point is a “local minimum solution”
• Continuous variables are optimal for the fixed integer variables
• Changing any single integer variable will not decrease the objective

No single feature, i.e. element of E can be added or dropped to decrease the objective value

Classifier defined by continuous variables is optimal for fixed choice of features, i.e.

Objective function value at iteration r. Bounded below and nonincreasing ) convergent

RFSVM Convergence
Numerical Experience
• Test on 5 public datasets from the UCI repository
• Plot classification accuracy versus number of features used
• Compare to linear classifiers which reduce features automatically (BM 1998)
• Compare to nonlinear SVM with no feature reduction
• To reduce running time, 1/11 of each dataset was used as a tuning set to select  and the kernel parameter
• Remaining 10/11 used for 10-fold cross validation
• Procedure repeated 5 times for each dataset with different random choice of tuning set each time
• Similar behavior on all datasets indicates acceptable method variance
Ionosphere Dataset 351 Points in R34

If the appropriate value of  is selected, RFSVM can obtain higher accuracy using fewer features than SVM1 and FSV

Nonlinear SVM with no feature selection

RFSVM

Even for  = 0, some features may be removed when removing them decreases the hinge loss

Cross-validation accuracy

Note that accuracy decreases slightly until about 10 features remain, and then decreases more sharply as they are removed

Linear 1-norm SVM

Feature-selecting linear SVM (BM, 1998) using concave minimization

Number of features used

Running Time on the Ionosphere Dataset
• Averages 5.7 sweeps through the integer variables
• Averages 3.4 linear programs
• 75% of the time consumed in objective function evaluations
• 15% of time consumed in solving linear programs
• Complete experiment (1960 runs) took 1 hour
• 3 GHz Pentium 4
• Written in MATLAB
• CPLEX 9.0 used to solve the linear programs
• Gaussian kernel written in C
WPBC (24 mo.) Dataset155 Points in R32

Cross-validation accuracy

Number of features used

BUPA Liver Dataset345 Points in R6

Cross-validation accuracy

Number of features used

Sonar Dataset208 Points in R60

Cross-validation accuracy

Number of features used

Spambase Dataset4601 Points in R57

Nearly identical results are obtained for  = 0 and 1

Cross-validation accuracy

Number of features used

Related Work
• Approaches that use specialized kernels
• Weston, Mukherjee, Chapelle, Pontil, Poggio, and Vapnik, 2000: structural risk minimization
• Gold, Holub, and Sollich, 2005: Bayesian interpretation
• Zhang, 2006: smoothing spline ANOVA kernels
• Margin based approach
• Frölich and Zell, 2004: remove features if there is little change to the margin if they are removed
• Other approaches which combine feature selection with basis reduction
• Bi, Bennett, Embrechts, Breneman, and Song, 2003
• Avidan, 2004
Conclusion
• Implemented new rigorous formulation for feature selection in nonlinear SVM classifiers
• Obtained a local solution to the resulting mixed-integer program by alternating between a linear program to compute continuous variables and successive sweeps through the objective function to update the integer variables
• Results on 5 publicly available datasets show approach efficiently learns accurate nonlinear classifiers with reduced numbers of features
Future Work
• Datasets with more features
• Reduce the number of objective function evaluations
• Limit the number of integer cycles
• Other ways to update the integer variables
• Application to regression problems
• Automatic choice of 
Questions
• Websites with links to papers and talks:
• http://www.cs.wisc.edu/~olvi
• http://www.cs.wisc.edu/~wildt