Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution - PowerPoint PPT Presentation

meghan
feature selection for high dimensional data a fast correlation based filter solution n.
Skip this Video
Loading SlideShow in 5 Seconds..
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution PowerPoint Presentation
Download Presentation
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

play fullscreen
1 / 24
Download Presentation
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
109 Views
Download Presentation

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007

  2. Outline • Introduction to Feature Selection • Feature Selection Models • Fast Correlation-Based Filter (FCBF) Algorithm • Experiment • Discussion • Reference

  3. Introduction of Feature Selection • Definition • A process that chooses an optimal subset of features according to an objective function • Objectives • To reduce dimensionality and remove noise • To improve mining performance • Speed of learning • Predictive accuracy • Simplicity and comprehensibility of mined results

  4. An Example for Optimal Subset • Data set (whole set) • Five Boolean features • C = F1∨F2 • F3= ┐F2 ,F5= ┐F4 • Optimal subset: • {F1, F2}or{F1, F3}

  5. Models of Feature Selection • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (information, distance, dependence, consistency) • No bias toward any learning algorithm, fast • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive

  6. Filter Model

  7. Wrapper Model

  8. Two Aspects for Feature Selection • How to decide whether a feature is relevant to the class or not • How to decide whether such a relevant feature is redundant or not compared to other features

  9. Linear Correlation Coefficient • For a pair of variables (x,y): • However, it may not be able to capture the non-linear correlations

  10. Information Measures • Entropy of variable X • Entropy of X after observing Y • Information Gain • Symmetrical Uncertainty

  11. Fast Correlation-Based Filter (FCBF) Algorithm • How to decide whether a feature is relevant to the class C or not • Find a subset , such that • How to decide whether such a relevant feature is redundant • Use the correlation of features and class as a reference

  12. Definitions • Predominant Correlation • The correlation between a feature and the class C is predominant • Redundant peer (RP) • If there is , is a RP of • Use to denote the set of RP for

  13. i C

  14. Three Heuristics • If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them • If , process all the features in at first. If non of them becomes predominant, follow the first heuristic • The feature with the largest value is always a predominant feature and can be a starting point to remove other features.

  15. i C

  16. FCBF Algorithm Time Complexity: O(N)

  17. FCBF Algorithm (cont.) Time complexity: O(NlogN)

  18. Experiments • FCBF are compared to ReliefF, CorrSF and ConsSF • Summary of the 10 data sets

  19. Results

  20. Results (cont.)

  21. Pros and Cons • Advantage • Very fast • Select fewer features with higher accuracy • Disadvantage • Cannot detect some features • 4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features

  22. Discussion • FCBF compares only individual features with each other • Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.

  23. Reference • L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 • Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005. • www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf • www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt

  24. Thank you! Q and A