1 / 40

AMCS/CS 340 : Data Mining

Feature Selection. AMCS/CS 340 : Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Outline. Introduction Unsupervised Feature Selection Clustering Matrix Factorization Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier)

jafari
Download Presentation

AMCS/CS 340 : Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection AMCS/CS 340 : Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Input dimension is too large; the curse of dimensionality problem may happen; Poor model may be built with additional unrelated inputs or not enough relevant inputs; Complex models which contain too many inputs are more difficult to understand Problems due to poor variable selection 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. Applications OCR (optical character recognition) HWR (handwriting recognition) 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. Benefits of feature selection 5 Facilitating data visualization Data understanding Reducing the measurement and storage requirements Reducing training and utilization times Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. Thousands to millions of low level features: select/extract the most relevant one to build better, faster, and easierto understand learning machines. Feature Selection/Extraction m d d<<m X • Using label Y  • supervised • Without label Y  • unsupervised Y N {Fj} {fi} 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. choose a best subset of size d from the m features {fi} can be a subset of {Fj}, i=1,…,d, and j=1,…,m extractd new features by linear or non-linear combination of all the m features Linear/Non-linear feature extraction: {fi} = f({Fj}) New features may not have physical interpretation/meaning Feature Selection vs Extraction Selection: m Extraction: d X Y N {fi} {Fj} 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Feature Selection by Clustering • Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. Example of student project Abdullah Khamis, AMCS/CS340 2010 Fall, “Statistical Learning Based System for Text Classification” 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. Other unsupervised FS methods • Matrix Factorization • PCA (Principal Component Analysis) • use PCs with largest eigenvalues as “features” • SVD (Singular Value Decomposition) • use singular vectors with largest singular values as “features” • NMF (Non-negative Matrix Factorization) • Nonlinear Dimensionality Reduction • Isomap • LLE (Locally Linear Embedding) 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. Build better, faster, and easierto understand learning machines Discover the most relevant features w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients Feature Ranking m X d N Rank of useful features. - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. Example of detecting attacks in real HTTP logs A common request. A JS XSS attack. Remote file inclusion attack DoSattack. • Represent each HTTP request by a vector • in 95 dimensions, corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, • Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose? • Which one is better? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Individual Feature Ranking (1)by AUC 1. Rank the features by AUC  1, most related  0.5, most unrelated 1 ROC curve True Positive Rate AUC -1 0 1 False Positive Rate xi 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. 2. Rank the features by Mutual information I(i)  The higher I(i), the attributexi is more related to class y Mutual information between each variable and the target: P(Y = y): frequency count of class y P(X = xi): frequency count of attribute value xi P(X = xi,Y= y): frequency count of attribute value xi given class y Individual Feature Ranking (2)by Mutual Information 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. 3.Rank features by Pearson correlation coefficient detect linear dependencies between variable and target rank features by R(i) or R2(i) (linear regression)  1 related;  0 unrelated Individual Feature Ranking (3) with continuous target 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Individual Feature Ranking (4) by T-test m- m+ • Null hypothesis H0: m+ = m- (xi and Y are independent) • Relevance index  test statistic • T statistic: If H0 is true, • 4. Rank by Pvalue  false positive rate • The lower Pvalue, xi is more related to class y -1 xi s- s+ 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Individual Feature Ranking (5) by Fisher Score m- m+ • • Fisher discrimination • Two-class case: • F = between class variance / pooled within class variance • 5. Rank by F value • The higher F, xi is more related to class y -1 xi s- s+ 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. Rank features in HTTP logs http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip 20

  21. Issues of individual features ranking 21 • Relevance vsusefulness: • Relevance does not imply usefulness. • Usefulness does not imply relevance • Leads to the selection of a redundant subset kbest features != best kfeatures • A variable that is useless by itself can be useful with others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  22. Useless features become useful 22 Separation is gained by using two variables instead of one or by adding variables Ranking variables individually and independently of each other is at loss to determine which combination of variables would give best performance.

  23. Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Multivariate Feature Selection is complex Kohavi-John, 1997 M features, 2M possible feature subsets! 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. Objectives of feature selection 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. How to search the space of all possible variable subsets? Do we use the prediction performance to guide the search? NO  Filter Yes  Wrapper how to assess the prediction performance of a learning machine to guide the search and halt it which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Questions before subset feature selection 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. The feature subset is chosen by an evaluation criterion, which measures the relation of each subset of input variables, e.g., correlation based feature selector (CFS) subsets that contain features that are highly correlated with the class and uncorrelated with each other Feature subset All features Predictor Filter Filter: Feature subset selection mean feature-class correlation how predictive of the class a set of features are average feature-feature intercorrelation how much redundancy there is among the feature subset 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. Search in all possible feature subsets? k=1,…,M? exhaustive enumeration forward selection, backward elimination, best first, forward/backward with a stopping criterion Filter method isa pre-processing step, which is independent of the learning algorithm. Feature subset All features Predictor Filter Filter: Feature subset selection (2) 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. Start Forward Selection Sequential forward selection (SFS), features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion n n-1 n-2 1 … Also referred to as SFS: Sequential Forward Selection 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  30. Backward Elimination 1 n-2 n-1 n … Sequential backward selection (SBS), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion. Start Also referred to as SBS: Sequential Backward Selection 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  31. Multiple Feature subsets All features Predictor Wrapper Wrapper: Feature selection methods • Learning model is used as a part of evaluation function and also to induce the final learning model • Subsets of features are scored according to their predictive power • Optimizing the parameters of the model by measuring some cost functions. • Danger of over-fitting with intensive search! 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  32. Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Performance degradation? Train SVM Train SVM Train SVM Train SVM Train SVM RFE SVM Recursive Feature Elimination (RFE) SVM.Guyon-Weston, 2000. US patent 7,117,188 All features Yes, stop! 1: repeat 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value |wi| 4: until a desired number of features remain. No, continue… 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. Selecting feature subsets in HTTP logs 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. Main goal: rank subsets of useful features Search strategies: explore the space of all possible feature combinations Two criteria: predictive power (maximize) and subset size (minimize). Predictive power assessment: – Filter methods: criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics – Wrapper methods: the performance of a learning machine trained using a given feature subset Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. Filter method is much faster but it do not incorporate learning. Comparsion of Filter and Wrapper: 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. Tree classifiers, like CART (Breiman, 1984)orC4.5 (Quinlan, 1993) All the data At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. Feature subset selection by Random Forest f2 f1 Choose f1 Choose f2 Forward Selection w. Trees 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  36. Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. Feature selection focuses on uncovering subsets of variables X1, X2, …predictive of the target Y. Univariate feature selection How to rank the features? Multivariate (subset) feature selection Filter, Wrapper, Embedded How to search the subset of features? How to evaluate the subsets of features? Feature extraction How to construct new features in linear/non-linear ways? Conclusion 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  38. No method is universally better: wide variety of types of variables, data distributions, learning machines, and objectives. Match the method complexity to the ratio M/N: univariate feature selection may work better than multivariate feature selection; non-linear classifiers are not always better. Feature selection is not always necessary to achieve good performance. In practice NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  39. Matlab: sequentialfs (Sequential feature selection, shown in demo) Forward ---- good Backward --- be careful on definition of criteria Feature Selection Toolbox 3 – freely available and open-source software in C++. Weka Feature selection toolbox 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. An introduction to variable and feature selection, Isabelle Guyon, André Elisseeff, JMLR 2003 Feature Extraction, Foundations and Applications, Isabelle Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book PabitraMitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) Prof. Marc Van Hulle, KatholiekeUniversiteitLeuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf Reference 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related