1 / 56

Feature Selection

Feature Selection. Machine Learning Workshop August 23 rd , 2007 Alex Shyr. Adapted from Ben Blum’s Slides. Outline. Introduction What is feature selection? Why do it? Filtering Model selection Model evaluation Model search Regularization Kernel methods Miscellaneous topics Summary.

Download Presentation

Feature Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection Machine Learning Workshop August 23rd, 2007 Alex Shyr Adapted from Ben Blum’s Slides

  2. Outline • Introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  3. Introduction • Data: pairs • : vector of features • Features can be real ( ), categorical ( ), or more structured • y: response (dependent) variable • : binary classification • : regression • Typically, this is what we want to be able to predict, having observed some new .

  4. Featurization • Data is often not originally in vector form • Have to choose features • Features often encode expert knowledge of the domain • Can have a huge effect on performance • Example: documents • “Bag of words” featurization: throw out order, keep count of how many times each word appears. • Surprisingly effective for many tasks

  5. What is feature selection? • Reducing the feature space by throwing out some of the features (covariates) • Also called variable selection • Motivating idea: try to find a simple, “parsimonious” model • Occam’s razor: simplest explanation that accounts for the data is best • Augmenting feature space to fit better model • see kernel methods

  6. What is feature selection? Task: classify whether a document is about cats Data: word counts in the document X Reduced X

  7. Why do it? • Case 1 (Interpretation): • We’re interested in features—we want to know which are relevant. The model should be interpretable. • Case 2 (Performance): • We’re interested in prediction; we just want to build a good predictor, with as little features as possible. • By removing redundant and irrelevant features, we can usually improve model performance. • By adding to and enriching the feature space, we can also improve model performance.

  8. Why do it? Case 1. We want to know which features are relevant. Examples • What causes lung cancer? • Features are aspects of a patient’s medical history • Binary response variable: did the patient develop lung cancer? • Which features best predict whether lung cancer will develop? Might want to legislate against these features. • What causes a program to crash? [Alice Zheng ’03, ’04, ‘05] • Features are aspects of a single program execution • Which branches were taken? • What values did functions return? • Binary response variable: did the program crash? • Features that predict crashes well are probably bugs.

  9. Why do it? Case 2. We want to build a good predictor. Examples • Text classification • Features for all 105 English words, and maybe all word pairs • Common practice: throw in every feature you can think of, let feature selection get rid of useless ones • Training too expensive with all features • The presence of irrelevant features hurts generalization. • Embedded systems with limited resources • Voice recognition on a cell phone • Branch prediction in a CPU (4K code limit) • Classifier must be compact

  10. Outline • Review/introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  11. Filtering • Simple techniques for weeding out irrelevant features without fitting model

  12. Filtering • Basic idea: assign score to each feature indicating how “related” and are. • Intuition: if for all i, then feature is good no matter what our model is—contains all information about . • Many popular scores [see Yang and Pederson ’97] • Classification with categorical data: Chi-squared, information gain • Can use binning to make continuous data categorical • Regression: correlation, mutual information • Markov blanket [Koller and Sahami, ’96] • Then somehow pick how many of the highest scoring features to keep (nested models)

  13. Comparison of filtering methods for text categorization [Yang and Pederson ’97]

  14. Filtering • Advantages: • Very fast • Simple to apply • Disadvantages: • Doesn’t take into account which learning algorithm will be used. • Doesn’t take into account correlations between features • This can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction. • Suggestion: use light filtering as an efficient initial step if there are many obviously irrelevant features • Caveat here too—apparently useless features can be useful when grouped with others

  15. Outline • Introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  16. Model Selection • Choosing between possible models of varying complexity • In our case, a “model” means a set of features • Running example: linear regression model

  17. Linear Regression Model Model Prediction rule: Data Response: Parameters • Typical loss function is sum squared error: • Recall that we can fit it via the normal equations: • Can be interpreted as maximum likelihood with

  18. Model Selection Model Prediction rule: Data Response: Parameters • Consider a reduced model with only those features for • Squared error is now • We want to pick out the “best” . Maybe this means theone with the lowest training error ? • Note • Just zero out terms in to match . • Generally speaking, training error will only go up in a simpler model. So why should we use one?

  19. 30 25 Degree 15 polynomial 20 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 20 18 Overfitting example • This model is too rich for the data • Fits training data well, but doesn’t generalize.

  20. Model evaluation • Moral 1: In the presence of many irrelevant features, we might just fit noise. • Moral 2: Training error can lead us astray. • To evaluate a feature set , we need a better scoring function • We’ve seen that is not appropriate. • We’re not ultimately interested in training error; we’re interested in test error (error on new data). • We can estimate test error by pretending we haven’t seen some of our data. • Keep some data aside as a validation set.

  21. K-fold cross validation • A technique for estimating test error • Uses all of the data to validate • Divide data into K groups . • Use each group as a validation set, then average all validation errors X7 X1 X6 test Learn X2 X5 X3 X4

  22. K-fold cross validation • A technique for estimating test error • Uses all of the data to validate • Divide data into K groups . • Use each group as a validation set, then average all validation errors X7 X1 X6 Learn X2 test X5 X3 X4

  23. K-fold cross validation • A technique for estimating test error • Uses all of the data to validate • Divide data into K groups . • Use each group as a validation set, then average all validation errors X7 X1 X6 Learn X2 X5 X3 X4

  24. Model Search • We have an objective function • Time to search for a good model. • This is known as a “wrapper” method • Learning algorithm is a black box • Just use it to compute objective function, then do search • Exhaustive search expensive • 2n possible subsets • Greedy search is common and effective

  25. Model search Forward selection Initialize s={} Do: Add feature to s which improves K(s) most While K(s) can be improved Backward elimination Initialize s={1,2,…,n} Do: remove feature from s which improves K(s) most While K(s) can be improved • Backward elimination tends to find better models • Better at finding models with interacting features • But it is frequently too expensive to fit the large models at the beginning of search • Both can be too greedy.

  26. Model search • More sophisticated search strategies exist • Best-first search • Stochastic search • See “Wrappers for Feature Subset Selection”, Kohavi and John 1997 • For many models, search moves can be evaluated quickly without refitting • E.g. linear regression model: add feature that has most covariance with current residuals • YALE can do feature selection with cross-validation and either forward selection or backwards elimination. • Other objective functions exist which add a model-complexity penalty to the training error • AIC: add penalty to log-likelihood. • BIC: add penalty

  27. Outline • Review/introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  28. Regularization • In certain cases, we can move model selection into the induction algorithm • Only have to fit one model; more efficient. • This is sometimes called an embedded feature selection algorithm

  29. Regularization • Regularization: add model complexity penalty to training error. • for some constant C • Now • Regularization forces weights to be small, but does it force weights to be exactly zero? • is equivalent to removing feature f from the model

  30. L1 vs L2 regularization

  31. L1 vs L2 regularization • To minimize , we can solve by (e.g.) gradient descent. • Minimization is a tug-of-war between the two terms

  32. L1 vs L2 regularization • To minimize , we can solve by (e.g.) gradient descent. • For L1 regularization, w is forced into the corners • many components 0 • Solution is sparse

  33. L1 vs L2 regularization • To minimize , we can solve by (e.g.) gradient descent. • L2 regularization does not promote sparsity • Even without sparsity, regularization promotes generalization • limits expressiveness of model

  34. Lasso Regression [Tibshirani ‘94] • Simply linear regression with an L1 penalty for sparsity. • Two big questions: • 1. How do we perform this minimization? • With L2 penalty it’s easy • With L1 it’s not a least-squares problem any more • 2. How do we choose C?

  35. Least-Angle Regression • Up until a few years ago this was not trivial • Must fit model for every candidate C value • LARS (Least Angle Regression, Hastie et al, 2004) • Find trajectory of w for all possible C values simultaneously, as efficiently as least-squares • Can choose how many features are wanted Figure taken from Hastie et al (2004)

  36. Outline • Review/introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  37. Kernel Methods • Expanding feature space gives us new potentially useful features. • Kernel methods let us work implicitly in a high-dimensional feature space. • All calculations performed quickly in low-dimensional space.

  38. Feature engineering • Linear models: convenient, fairly broad, but limited • We can increase the expressiveness of linear models by expanding the feature space. • E.g. • Now feature space is R6 rather than R2 • Example linear predictor in these features:

  39. The kernel trick • Can still fit by old methods, but it’s more expensive • Many algorithms we’ve looked at only see data through inner products (or can be rephrased to do so) • Perceptron, logistic regression, etc. • But notice: • We can just compute inner product in original space. • This is called the kernel trick: • Working in high-dimensional feature space implicitly through an efficiently-computable inner product kernel.

  40. Kernel methods • For some models, can write the coefficients as a linear combination of the data points • Update rule for perceptron: • Representer theorem: for many kinds of models with linear parameters w, we can write for some a. • For linear regression, our predictor can be written • Never need to deal with w explicitly; just need a kernel to take the place of in comparing data points to each other.

  41. Kernel methods • Often more natural to define a similarity kernel than to define a feature space • Mercer theorem: every “qualifying” inner product kernel has an associated (possibly infinite-dimensional) feature space. • Polynomial kernels: • : feature space = all monomials in x and z of degree <= • RBF kernel: • : feature space is infinite dimensional

  42. Dynamic programming string kernel[Lodhi et al, 2002] • Feature space: all possible substrings of k letters, not necessarily contiguous. • E.g. a-p-l-s in “apples are tasty” • Value for each feature is exp{-(full length of substring in text)} • Very high dimensional! • Surprisingly, kernel can be computed efficiently using dynamic programming. • Runs in time linear in length of documents • Text classification results: superior to using bag-of-words feature space. • No way we could use this feature space without kernel methods.

  43. Kernel methods vs feature selection • Kernelizing is often, but not always, a good idea. • Often more natural to define a similarity kernel than to define a feature space, particularly for structured data • Sparsity • Typically regularize the model parameters • L1 norm gives sparse solutions • Solutions are sparse in the sense that only a few data points have non-zero weight: “support vectors”. • Feature/data exchange • After kernelization, data points act as features. • If many more (implicit) features than data points, more efficient • Given a set of support vectors , a new data point X has implicit feature vector • Prediction is then

  44. Outline • Review/introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  45. Decision Trees • Effectively a stepwise filtering method • In each subtree, only a subset of the data is considered • Split on top feature according to filtering criterion • Stop according to some stopping criterion • Depth, homogeneity, etc • In final tree, only a subset of features are used • Very useful with boosting • Connection between Adaboost and forward selection Tor23 A B Tor4 Tor27 A G B Tor4 Tor40 -130.2

  46. Dimension Reduction • Principal Component Analysis (PCA) • Project onto subspace withthe most variance • Unsupervised learning(doesn’t take response y into account) • Sufficient Dimension Reduction • Project onto subspace where, given the projected data, the response is most independent from the data • Other dimensionality reduction techniques in future lecture

  47. Outline • Review/introduction • What is feature selection? Why do it? • Filtering • Model selection • Model evaluation • Model search • Regularization • Kernel methods • Miscellaneous topics • Summary

  48. Summary Computational cost • Filtering • L1 regularization (embedded methods) • Kernel methods • Wrappers • Forward selection • Backward selection • Other search • Exhaustive

  49. Summary Computational cost • Filtering • L1 regularization (embedded methods) • Kernel methods • Wrappers • Forward selection • Backward selection • Other search • Exhaustive • Good preprocessing step • Information-based scores seem most effective • Information gain • More expensive: Markov Blanket [Koller & Sahami, ’97] • Fail to capture relationship between features

  50. Summary Computational cost • Filtering • L1 regularization (embedded methods) • Kernel methods • Wrappers • Forward selection • Backward selection • Other search • Exhaustive • Fairly efficient • LARS-type algorithms now exist for many linear models • Not applicable for all models • Linear methods can be limited • Common: fit a linear model initially to select features, then fit a nonlinear model with new feature set

More Related