Feature Selection Methods for Embedded Models

Lecture 4:Embedded methods Isabelle Guyon isabelle@clopinet.com

Filters,Wrappers, andEmbedded methods Feature subset All features Filter Predictor Multiple Feature subsets All features Predictor Feature subset Embedded method All features Wrapper Predictor

Filters Methods: • Criterion: Measure feature/feature subset “relevance” • Search: Usually order features (individual feature ranking or nested subsets of features) • Assessment: Use statistical tests • Are (relatively) robust against overfitting • May fail to select the most “useful” features Results:

Wrappers Methods: • Criterion: A risk functional • Search: Search the space of feature subsets • Assessment: Use cross-validation • Can in principle find the most “useful” features, but • Are prone to overfitting Results:

Embedded Methods Methods: • Criterion: A risk functional • Search:Search guided by the learning process • Assessment: Use cross-validation • Similar to wrappers, but • Less computationally expensive • Less prone to overfitting Results:

Single feature relevance Cross validation Relevance in context Embedded Single feature relevance Performance bounds Feature subset relevance Cross validation Single feature relevance Relevance in context Performance learning machine Statistical tests Cross validation Performance bounds Feature subset relevance Nested subset, forward selection/ backward elimination Relevance in context Heuristic or stochastic search Performance bounds Feature subset relevance Performance learning machine Exhaustive search Single feature ranking Statistical tests Wrappers Filters Nested subset, forward selection/ backward elimination Performance learning machine Heuristic or stochastic search Statistical tests Nested subset, forward selection/ backward elimination Exhaustive search Single feature ranking Heuristic or stochastic search Exhaustive search Single feature ranking Three “Ingredients” Assessment Criterion Search

Forward Selection (wrapper) Start n n-1 n-2 1 … Also referred to as SFS: Sequential Forward Selection

Forward Selection (embedded) Start n n-1 n-2 1 … Guided search: we do not consider alternative paths.

Forward Selection with GS Embedded method for the linear least square predictor Stoppiglia, 2002. Gram-Schmidt orthogonalization. • Select a first feature Xν(1)with maximum cosine with the target cos(xi, y)=x.y/||x|| ||y|| • For each remaining feature Xi • Project Xi and the target Y on the null space of the features already selected • Compute the cosine of Xi with the target in the projection • Select the feature Xν(k)with maximum cosine with the target in the projection.

Forward Selection w. Trees At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f2 f1 Choose f1 Choose f2 • Tree classifiers, like CART (Breiman, 1984)orC4.5 (Quinlan, 1993)

Backward Elimination (wrapper) Also referred to as SBS: Sequential Backward Selection 1 n-2 n-1 n … Start

Backward Elimination (embedded) 1 n-2 n-1 n … Start

Backward Elimination: RFE Embedded method for SVM, kernel methods, neural nets. RFE-SVM, Guyon, Weston, et al, 2002 • Start with all the features. • Train a learning machine f on the current subset of features by minimizing a risk functional J[f]. • For each (remaining) feature Xi, estimate, without retraining f, the change in J[f] resulting from the removal of Xi. • Remove the feature Xν(k) that results in improving or least degrading J.

Scaling Factors Idea: Transform a discrete space into a continuous space. s=[s1, s2, s3, s4] • Discrete indicators of feature presence:si{0, 1} • Continuous scaling factors:si IR Now we can do gradient descent!

Formalism ( chap. 5) Training set Learning algorithm output Definition: an embedded feature selection method is a machine learning algorithm that returns a model using a limited number of features. Next few slides: André Elisseeff

Parameterization Consider the following set of functions parameterized by  and where  {0,1}n represents the use (i=1) or rejection of feature i. 3=0 1=1 output

Example: Kernel methods f(a, sox) = Siai k(soxi, sox) N n X m xi a 1 1 1 … 0 0 0 … s

Feature selection as an optimization problem Find  and  that minimize a risk functional: unknown distribution loss function Problem: we do not know P(x, y)… … all we have are training examples (x1, y1), (x2, y2), … (xm, ym)

Approximations of R[f] • Empirical risk: Rtrain[f] = (1/n)i=1:m L(f(xi; w), yi) • Guaranteed risk: with proba (1-d), R[f]Rgua[f] Rgua[f] =Rtrain[f]+ e(d,C) • Structural risk minimization: • Sk = { w | ||w||2< wk2 }, w1<w2<…<wk • min Rtrain[f] s.t. ||w||2< wk2 • Regularized riskRreg[f,g] = Rtrain[f] + g ||w||2

Carrying out the optimization • How to minimize? Most approaches use the following method: This optimization is often done by relaxing the constraint  {0,1}n as  [0,1]n

Add/Remove features 1 • Many learning algorithms are cast into a minimization of some regularized functional: • What does G() become if one feature is removed? • Sometimes, G can only increase… (e.g. SVM) Regularization capacity control Empirical error

Add/Remove features 2 • It can be shown (under some conditions) that the removal of one feature will induce a change in G proportional to: Gradient of f wrt. ith feature at point xk • Examples: Linear SVM RFE (() = (w) = i wi2)

Add/Remove features - RFE • Recursive Feature Elimination Minimize estimate of R(,) wrt.  Minimize the estimate R(,) wrt.  and under a constraint that only limited number of features must be selected

Add/Remove featuresummary • Many algorithms can be turned into embedded methods for feature selections by using the following approach: • Choose an objective function that measure how well the model returned by the algorithm performs • “Differentiate” (or sensitivity analysis) this objective function according to the  parameter (i.e. how does the value of this function change when one feature is removed and the algorithm is rerun) • Select the features whose removal (resp. addition) induces the desired change in the objective function (i.e. minimize error estimate, maximize alignment with target, etc.) What makes this method an ‘embedded method’ is the use of the structure of the learning algorithm to compute the gradient and to search/weight relevant features.

Gradient descent - 1 • How to minimize? Most approaches use the following method: Would it make sense to perform just a gradient step here too? Gradient step in [0,1]n.

Gradient descent 2 Advantage of this approach: • can be done for non-linear systems (e.g. SVM with Gaussian kernels) • can mix the search for features with the search for an optimal regularization parameters and/or other kernel parameters. Drawback: • heavy computations • back to gradient based machine algorithms (early stopping, initialization, etc.)

Gradient descentsummary • Many algorithms can be turned into embedded methods for feature selections by using the following approach: • Choose an objective function that measure how well the model returned by the algorithm performs • Differentiate this objective function according to the  parameter • Performs a gradient descent on . At each iteration, rerun the initial learning algorithm to compute its solution on the new scaled feature space. • Stop when no more changes (or early stopping, etc.) • Threshold values to get list of features and retrain algorithm on the subset of features. Difference from add/remove approach is the search strategy. It still uses the inner structure of the learning model but it scales features rather than selecting them.

Changing the structure • Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2< wk }, w1<w2<…<wk g1 > g2 > g3 >… > gk (gis the ridge) • Feature selection (0-norm SVM): Sk = { w | ||w||0< sk }, s1<s2<…<sk (sis the number of features) • Feature selection (lasso regression, 1-norm SVM): Sk = { w | ||w||1< bk },

The l0 SVM • Replace the regularizer ||w||2 by the l0 norm • Further replace by i log( + |wi|) • Boils down to the following multiplicative update algorithm:

The l1 SVM • The version of the SVM where ||w||2 is replace by the l1 norm i |wi| can be considered as an embedded method: • Only a limited number of weights will be non zero (tend to remove redundant features) • Difference from the regular SVM where redundant features are all included (non zero weights)

Mechanical interpretation w2 w2 w* w* w1 w1 J = l ||w||22 + ||w-w*||2 J = ||w||22 + 1/l ||w-w*||2 w2 Lasso w* w1 J = ||w||1+ 1/l ||w-w*||2 Ridge regression

Embedded method - summary • Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms: • Find a functional that represents your prior knowledge about what a good model is. • Add the s weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently • Optimize alternatively according to a and s • Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features • Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Book of the NIPS 2003 challenge Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book

Feature Selection Methods for Embedded Models