1 / 88

Feature Selection

Feature Selection. Advanced Statistical Methods in NLP Ling 572 January 24, 2012. Roadmap. Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures

yehuda
Download Presentation

Feature Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

  2. Roadmap • Feature representations: • Features in attribute-value matrices • Motivation: text classification • Managing features • General approaches • Feature selection techniques • Feature scoring measures • Alternative feature weighting • Chi-squared feature selection

  3. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates

  4. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features

  5. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction

  6. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction • Weighting features: increase/decrease feature import

  7. Representing Input:Attribute-Value Matrix • Choosing features: • Define features – i.e. with feature templates • Instantiate features • Perform dimensionality reduction • Weighting features: increase/decrease feature import • Global feature weighting: weight whole column • Local feature weighting: weight cell, conditions

  8. Feature Selection Example • Task: Text classification • Feature template definition:

  9. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation:

  10. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection:

  11. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection: • Stopword removal: remove top K (~100) highest freq • Words like: the, a, have, is, to, for,… • Feature weighting:

  12. Feature Selection Example • Task: Text classification • Feature template definition: • Word – just one template • Feature instantiation: • Words from training (and test?) data • Feature selection: • Stopword removal: remove top K (~100) highest freq • Words like: the, a, have, is, to, for,… • Feature weighting: • Apply tf*idf feature weighting • tf = term frequency; idf = inverse document frequency

  13. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions

  14. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size

  15. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic:

  16. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness

  17. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN

  18. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN • Leads to high computational cost

  19. The Curse of Dimensionality • Think of the instances as vectors of features • # of features = # of dimensions • Number of features potentially enormous • # words in corpus continues to increase w/corpus size • High dimensionality problematic: • Leads to data sparseness • Hard to create valid model • Hard to predict and generalize – think kNN • Leads to high computational cost • Leads to difficulty with estimation/learning • More dimensions  more samples needed to learn model

  20. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance

  21. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf.

  22. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf. • Functionally, • Many ML algorithms do not scale well

  23. Breaking the Curse • Dimensionality reduction: • Produce a representation with fewer dimensions • But with comparable performance • More formally, given an original feature set r, • Create a new set r’ |r’| < |r|, with comparable perf. • Functionally, • Many ML algorithms do not scale well • Expensive: Training cost, training cost • Poor prediction: overfitting, sparseness

  24. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches:

  25. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches: • r’: same for all classes (aka global), vs • r’: different for each class (aka local)

  26. Dimensionality Reduction • Given an initial feature set r, • Create a feature set r’ s.t. |r| < |r’| • Approaches: • r’: same for all classes (aka global), vs • r’: different for each class (aka local) • Feature selection/filtering, vs • Feature mapping (aka extraction)

  27. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features?

  28. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches:

  29. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches: • For each subset of features: • Build, evaluate classifier for some task • Pick subset of features with best performance

  30. Feature Selection • Feature selection: • r’ is a subset of r • How can we pick features? • Extrinsic ‘wrapper’ approaches: • For each subset of features: • Build, evaluate classifier for some task • Pick subset of features with best performance • Intrinsic ‘filtering’ methods: • Use some intrinsic (statistical?) measure • Pick features with highest scores

  31. Feature Selection • Wrapper approach: • Pros:

  32. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons:

  33. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros

  34. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros: theoretical basis, less task, classifier specific • Cons:

  35. Feature Selection • Wrapper approach: • Pros: • Easy to understand, implement • Clear relationship b/t selected features and task perf. • Cons: • Computationally intractable: 2|r’|*(training + testing) • Specific to task, classifier; ad-hov • Filtering approach: • Pros: theoretical basis, less task, classifier specific • Cons: Doesn’t always boost task performance

  36. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r

  37. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated

  38. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples:

  39. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples: • Term classes: e.g. class-based n-grams • Derived from term clusters

  40. Feature Mapping • Feature mapping (extraction) approaches • Features r’ representation combinations/transformations of features in r • Example: many words near-synonyms, but treated as unrelated • Map to new concept representing all • big, large, huge, gigantic, enormous  concept of ‘bigness’ • Examples: • Term classes: e.g. class-based n-grams • Derived from term clusters • Dimensions in • Latent Semantic Analysis (LSA/LSI) • Result of Singular Value Decomposition (SVD) on matrix • Produces ‘closest’ rank r’ approximation of original

  41. Feature Mapping • Pros:

  42. Feature Mapping • Pros: • Data-driven • Theoretical basis – guarantees on matrix similarity • Not bound by initial feature space • Cons:

  43. Feature Mapping • Pros: • Data-driven • Theoretical basis – guarantees on matrix similarity • Not bound by initial feature space • Cons: • Some ad-hoc factors: • e.g. # of dimensions • Resulting feature space can be hard to interpret

  44. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

  45. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class • Fairly fast and classifier-independent

  46. Feature Filtering • Filtering approaches: • Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class • Fairly fast and classifier-independent • Many different measures: • Mutual information • Information gain • Chi-squared • etc…

  47. Feature Scoring Measures

  48. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C

  49. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C • P(tk): proportion of documents in which tk appears • P(ci): proportion of documents of class ci • Binary so have

  50. Basic Notation, Distributions • Assume binary representation of terms, classes • tk: term in T; ci: class in C • P(tk): proportion of documents in which tk appears • P(ci): proportion of documents of class ci • Binary so have

More Related