Feature Selection - PowerPoint PPT Presentation

feature selection n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Feature Selection PowerPoint Presentation
Download Presentation
Feature Selection

play fullscreen
1 / 88
Feature Selection
305 Views
Download Presentation
yehuda
Download Presentation

Feature Selection

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

  2. Roadmap • Feature representations: • Features in attribute-value matrices • Motivation: text classification • Managing features • General approaches • Feature selection techniques • Feature scoring measures • Alternative feature weighting • Chi-squared feature selection

  3. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates

  4. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features

  5. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction

  6. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction • Weighting features: increase/decrease feature import

  7. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction • Weighting features: increase/decrease feature import • Global feature weighting: weight whole column • Local feature weighting: weight cell, conditions

  8. Feature Selection Example • Task: Text classification • Feature template definition:

  9. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation:

  10. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection:

  11. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection: • Stopword removal: remove top K (~100) highest freq • Words like: the, a, have, is, to, for,… • Feature weighting:

  12. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection: • Stopword removal: remove top K (~100) highest freq • Words like: the, a, have, is, to, for,… • Feature weighting: • Apply tf*idf feature weighting • tf = term frequency; idf = inverse document frequency

  13. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions

  14. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size

  15. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic:

  16. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness

  17. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN

  18. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN • Leads to high computational cost

  19. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN • Leads to high computational cost • Leads to difficulty with estimation/learning • More dimensions  more samples needed to learn model

  20. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance

  21. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf.

  22. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf. • Functionally, • Many ML algorithms do not scale well

  23. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf. • Functionally, • Many ML algorithms do not scale well • Expensive: Training cost, training cost • Poor prediction: overfitting, sparseness

  24. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches:

  25. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches: • r’: same for all classes (aka global), vs • r’: different for each class (aka local)

  26. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches: • r’: same for all classes (aka global), vs • r’: different for each class (aka local) • Feature selection/filtering, vs • Feature mapping (aka extraction)

  27. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features?

  28. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches:

  29. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches: • For each subset of features: • Build, evaluate classifier for some task • Pick subset of features with best performance

  30. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches: • For each subset of features: • Build, evaluate classifier for some task • Pick subset of features with best performance • Intrinsic ‘filtering’ methods: • Use some intrinsic (statistical?) measure • Pick features with highest scores

  31. Feature Selection • Wrapper approach: • Pros:

  32. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons:

  33. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros

  34. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros: theoretical basis, less task, classifier specific • Cons:

  35. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros: theoretical basis, less task, classifier specific • Cons: Doesn’t always boost task performance

  36. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r

  37. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated

  38. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples:

  39. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples: • Term classes: e.g. class-based n-grams • Derived from term clusters

  40. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples: • Term classes: e.g. class-based n-grams • Derived from term clusters • Dimensions in • Latent Semantic Analysis (LSA/LSI) • Result of Singular Value Decomposition (SVD) on matrix • Produces ‘closest’ rank r’ approximation of original

  41. Feature Mapping • Pros:

  42. Feature Mapping • Pros: • Data-driven • Theoretical basis – guarantees on matrix similarity • Not bound by initial feature space • Cons:

  43. Feature Mapping • Pros: • Data-driven • Theoretical basis – guarantees on matrix similarity • Not bound by initial feature space • Cons: • Some ad-hoc factors: • e.g. # of dimensions • Resulting feature space can be hard to interpret

  44. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

  45. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class • Fairly fast and classifier-independent

  46. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class • Fairly fast and classifier-independent • Many different measures: • Mutual information • Information gain • Chi-squared • etc…

  47. Feature Scoring Measures

  48. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C

  49. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C • P(tk): proportion of documents in which tk appears • P(ci): proportion of documents of class ci • Binary so have

  50. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C • P(tk): proportion of documents in which tk appears • P(ci): proportion of documents of class ci • Binary so have