1 / 38

Feature selection

Feature selection. Using slides by Gideon Dror, Alon Kaufman and Roy. Learning to Classify. Learning of binary classification Given: a set of m examples ( x i ,y i ) i = 1,2…m sampled from some distribution D, where x i  R n and y i {-1,+1}

novia
Download Presentation

Feature selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature selection Using slides by Gideon Dror, Alon Kaufman and Roy

  2. Learning to Classify Learning of binary classification • Given: a set of m examples (xi,yi) i = 1,2…m sampled from some distribution D, where xiRn and yi{-1,+1} • Find: a function f f: Rn -> {-1,+1} which classifies ‘well’ examples xj sampled from D. Examples: • microarray data: separate malignant from healthy tissues • text categorization: spam detection • Face detection: discriminating human faces from not faces. Learning algorithms: decision trees, nearest neighbors, bayesian networks, neural networks, Support Vector Machines …

  3. May Improve performance of classification algorithm by removing irrelevant features Defying the curse of dimensionality - improved generalization Classification algorithm may not scale up to the size of the full feature set either in space or time Allows us to better understand the domain Cheaper to collect and store data based on reduced feature set Advantages of dimensionality reduction

  4. Two approaches for dimensionality reduction • Feature construction • Feature selection (This talk)

  5. Methods of Feature construction • Linear methods • Principal component analysis (PCA) • ICA • Fisher linear discriminant • …. • Non-linear methods • Non linear component analysis (NLCA) • Kernel PCA • Local linear embedding (LLE) • ….

  6. Feature selection • Given examples (xi,yi) where xiRn, select a minimal subset of features which maximizes the performance (accuracy,….). • Exhaustive search is computationally prohibitive, except for a small number of dimensions. • There are 2n-1 possible combinations. • Basically it is an optimization problem, where the classification error is the function to be minimized.

  7. Feature selection classifier Feature selection classifier classifier Feature selection methods Filter methods Wrapper methods Embedded methods

  8. Filtering • Order all features according to strength of association with the target yi • Various measures of association may be used: • Pearson correlation R(Xi) = cov(Xi,Y)/XiY • 2 (discrete variables Xi) • Fisher Criterion Scoring F(Xi) = |+Xi- -Xi|/ (+Xi2+-Xi2) • Golub criterion F(Xi) = |+Xi- -Xi|/ |+Xi+-Xi| • Mutual information I(Xi,Y) =p(Xi,Y) log(p(Xi,Y)/p(Xni)p(Y) • … • Choose the first k features and feed them to the classifier

  9. Wrappers Use the classifier as a black box, to search in the space of feature subsets, the subset which maximizes classification accuracy. Search is exponentially hard. A common example of heuristic searchis hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward selection”) Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward selection”)

  10. Embedded methods: Recursive Feature Elimination - RFE 0. Set V = n (total number of features) 1. build linear Support Vector Machine classifiers using V features 2. compute weight vector w = iyixi of optimal hyperplane. Omit V/2 features with lowest |wi|. 3. repeat steps 1 and 2 until one feature is left 4. choose the feature subset that gives the best performance (using cross-validation) (Has strong theoretical justification)

  11. Margin Based Feature SelectionTheory and AlgorithmsRan Gilad-Bachrach, Amir Navot and Naftali Tishby • Feature selection based on the quality of margin they induce • Idea: use of large margin principle for feature selection • Supervise classification problem • “study-case” predictor: 1-NN

  12. Margins • Margins measure the classifier confidence • Sample-margin – distance between the instance and the decision boundary (SVM) • Hypothesis-margin – given an instance the distance between the hypothesis and the closet hypothesis that assigns an alternative label. • In the 1-NN case (Crammer et al 2002): • Previous results: the hypothesis margin lower bounds the sample margin • Motivation: choose the features that induce large margins

  13. Margins • Given a weight vector of the features: The evaluation function is defined for any weight vector w over the features:

  14. Margins For 1-NN x q nearmiss(x) nearhit(x) (Crammer et al. 2002, Bachrach et al. 2004) q = ½( ||x-nearmiss(x)|| - ||x-nearhit(x)|| )

  15. wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2 Iterative Search Based Algorithm(Simba) • For a set S with m samples and N features: • W=(1,1,1…..1) • For t=1:T (number iterations) • Pick a random instance x from S • Calculate nearmiss(x) and nearhit(x) considering w • For i=1:N • w=w+ •  (TNm) /  (Nm2)

  16. wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2 Iterative Search Based Algorithm(Simba) • For a set S with m samples and N features: • W=(1,1,1…..1) • For t=1:T (number iterations) • Pick a random instance x from S • Calculate nearmiss(x) and nearhit(x) considering w • For i=1:N • w=w+ •  (TNm) /  (Nm2)

  17. Application: Face Images • AR face database • 1456 images females and males • 5100 features • Train 1000 faces test: 456

  18. Faces – Average Results

  19. Unsupervised feature selection • Background: Motivation and Methods • Our Solution • SVD-Entropy and the CE criterion • Three Feature Selection Methods • Results R. Varshavsky, A. Gottlieb, M. Linial, D. Horn. ISMB 2006

  20. Background: Motivation • Gene Expression, Sequence Similarities • ‘Curse of dimensionality’, Dimension Reduction, Compression • Thousands – Tens of Thousands Genes in an array • Number of proteins in databases > million • Noise

  21. samples Genes/ features The Data: An Example • Gene Expression Experiments

  22. Background: Methods • Extraction Vs Selection • Most methods are supervised (i.e., have an objective function) • Unsupervised • Variance • Projection on the first PC (e.g., ‘gene-shaving’) • Statistical significant overabundance (Ben-Dor et al., 2001)

  23. SVD in genes expression

  24. Our Solution: SVD-Entropy • The Normalized relative Values (Wall et al., 2003)* • SVD-Entropy (Alter et al., 2000) * S2j are the eigen values of the [nXn] XX’ matrix

  25. SVD-Entropy (Example) A comparison of two eigenvalue distributions; the left has high entropy (0.87) and the right one has low entropy (0.14)

  26. CE – Contribution to the Entropy • The Contribution of the i-th feature to the overall Entropy is determined according to a leave-out-out measurement CEi=E(X[nXm]) – E(X[nX(m-1)])

  27. Golub AML/ALL data

  28. CEs suggest 3 groups of features • CEi>c high contribution  meaningful (?) • CEi=c average contribution  neutral • CEi<c low contribution  uniformity

  29. Three Feature Selection Methods • Simple Ranking (SR) • Forward selection (FS) • Aggregate the highest CE at a time • Select and remove the highest CE at a time • Backward Elimination (BE)

  30. Fauquet virus problem 61 viruses. 18 features (amino-acid compositions of coat proteins of the viruses). Four known classes.

  31. Ranking of the different methods

  32. Test: classification results

  33. samples Genes/ features Results - Example (Golub et al. 1999) • Leukemia • 72 patients (samples) • 7129 genes • 4 groups • Two major types ALL & AML • T & B Cells in ALL • With/without treatment in AML

  34. Results (Cont’)

  35. Results (Cont’)

  36. Overlap of features among methods

  37. Results (Cont’)

  38. n10 n11 n01 Real Algorithm Clustering Assessment • n11 – number of pairs that are classified together, both in the ‘real’ classification and by the algorithm • n10 – number of pairs that are classified together in the ‘real’ classification, but not by the algorithm • n01 – number of pairs that are classified together by the algorithm, but not in the ‘real’ classification 1 2 3 4

More Related