Selection of Relevant Features and Examples in Machine Learning

Selection of Relevant Features and Examples in Machine Learning Paper By: Avrim L. Blum Pat Langley Presented By: Arindam Bhattacharya (10305002) AkshatMalu (10305012) YogeshKakde (10305039) TanmayHaldankar (10305911)

Overview • Introduction • Selecting Relevant Features • Embedded Approaches • Filter Approaches • Wrapper Approaches • Feature Weighting Approaches • Selecting Relevant Examples • Selecting Labeled Data • Selecting Unlabeled Data • Challenges and Future Work

Introduction • Machine learning are addressing larger and complex tasks. • Internet has a huge volume of low quality information. • We focus on: • Selecting the most relevant features • Selecting the most relevant examples

Problems of Irrelevant Features • Not helpful in classification • Slow the learning process[1] • Number of training examples required grows exponentially with number of irrelevant features[2] [1] Cover and Hart, 1967 [2] Langley and Iba, 1993

Definitions of Relevance • Definition 1: Relevant to Target:- • A feature xi is relevant to a target concept C if there exists a pair of examples A and B such that A and B differ only in feature xi and c(A) ≠ c(B) . Blum et al., 1997

Definitions of Relevance • Definition 2: Strongly Relevant to sample:- • A feature is said to be strongly relevant to sample S if there exist examples A and B in S that differ only in feature xi and have different labels. John, Kohavi and Pfelger (1994)

Definitions of Relevance • Definition 3: Weakly Relevant to the sample:- • A feature xi is said to be weakly relevant to the sample S if it possible to remove a subset of the features so that xi becomes strongly relevant John, Kohavi and Pfelger (1994)

Definitions of Relevance • Definition 4: Relevance as complexity measure:- • Given a sample S and a set of concepts C, let r(S,C) be the number of features relevant (using definition 1) to a concept C that, out of all those whose error over S is least, has the fewest relevant features. Blum et al, 1997

Definitions of Relevance • Definition 5: Incremental Usefulness:- • Given a sample S, a learning algorithm L, and a feature set A, feature xi is incrementally useful to L if the accuracy of the hypothesis that L produces using the feature set {xi} U A is better than the accuracy achieved using just the feature set A. Caruana and Frietag, 1994

Example • Consider concepts can be expressed as disjunctions and the algorithm sees the following examples:

Example • Using Definition 2 and 3, we can say that x1 is strongly relevant while x2 is weakly relevant. • Using Definition 4, we can say that there are three relevant features (r(S,C)=3). • Using Definition 5, given the feature set {1,2}, the third feature may not be useful but features 4,5 would be useful.

Heuristic Search is an ideal paradigm for Feature selection algorithms. Feature Selection as Heuristic Search

Feature Selection as Heuristic Search Search Space Partial Order

Four Basic Issues • Where to start? • Forward Selection • Bckward Elimination

Four Basic Issues • How to organize the search? • Exhaustive search: 2n possibilities for n attributes • Greedy search: • Hill climbing • First is best

Four Basic Issues • Which is better? - Strategy to evaluate alternatives • Accuracy on training or separate evaluation set • Feature selection-basic induction interaction

Four Basic Issues • When to stop? • Stop when nothing improves • Go on until things worsen • Reach the end and select best • Each combination of selected features map to single class • Order by relevance and determine a break point

An Example – Set Cover Algorithm Disjunction of 0 features ` From safe features, select one that maximize correctly classified +ive example Any safe feature that improves is left? • Begins at the left of the figure • Incrementally move right • Evaluate based on performance on training set with ∞ penalty for misclassifying -ve example • Halts when no further step improves performance Output the selected features

Feature Selection Method • Feature selection methods are grouped into three classes : • Those that embed the selection into induction algorithm • Those that use feature selection algorithm to filter the attributes passed to induction algorithm • Those that treat feature selection as a wrapper around the induction process

Embedded Approaches to Feature Selection • For these class of algorithm, feature selection is embedded within basic induction algorithm. • Most algorithms for inducing logical concepts (e.g. the set-cover algorithm) adds or remove features from concept description based on prediction errors • For these algorithms, the feature space is also the concept space.

Embedded Approaches in Binary Feature Space • Gives attractive results for systems learning pure conjunctive (or pure disjunctive) rules. • At most logarithmic factor more than smallest possible hypothesis! • Also applies in settings where target hypothesis is characterized by conjunction (or disjunction) of functions produced by induction algorithms • e.g.: algorithms for learning DNF in O(nlog n) time[1] [1] (Verbeurgt, 1990)

Embedded Approaches for Complex Logical Concepts • In this approach, the core method adds/removes features to induce complex logical concepts. • e.g. ID3 [1] and C4.5 [2] • Greedy search through space of decision tree • Each stage select attribute that discriminate among classes using evaluation function (usually based on information theory) [1] (Quinlan, 1983) [2] (Quinlan, 1993)

Embedded Approaches: Scalability Issues • Experimental studies[1] suggest decision list learners scale linearly with increase in irrelevant features • For other target concepts, exhibit exponential growth. • (Kira and Rendell, 1992) shows that there is substantial decrease in accuracy on inserting irrelevant features into Boolean target concept. [1](Langley and Sage, 1997)

Embedded Approaches: Remedies • Problems are caused due to reliance on greedy selection of attributes to discriminate among classes. • Some researchers[1] attempted to replace greedy approach with look-ahead techniques. • Letting Greedy take larger steps[2]. • None has been able to handle scaling effectively. [1] Norton, 1989 [2] (Methes and Rendell, 1989; Pagallo and Haussler, 1990)

Filter Approaches • Feature selection is done based on some general characteristics of the training set. • Independent of the induction algorithm used, and thus, can be combined with any such method. John et al, 1994.

A Simple Filtering Scheme • Evaluate each feature individually based on its correlation with the target function. • Select the ‘k’ features with the highest value. • The best choice of ‘k’ can be determined by testing on a holdout set. Blum et al, 1997.

FOCUS Algorithm • Looks for the minimal combinations of attributes that perfectly discriminate among the classes • Halt only when a pure partition of the training set is generated • Performance: Under similar conditions, FOCUS was almost unaffected by the introduction of irrelevant attributes, whereas decision-tree accuracy degraded significantly. { f1, f2, f3,…, fn} { f1, f2, f3,…, fn} { f1, f2, f3,…, fn} { f1,f2, f3,…, fn} { f1, f2,f3,…, fn} { f1,f2,f3,…, fn} { f1, f2, f3,…, fn} Almuallim et al,1991.

Comparing Various Filter Approaches Blum et al, 1997.

Wrapper Approaches (1/2) • Motivation: The features selected should depend not only on the relevance of the data, but also on the learning algorithm. John et al, 1994.

Wrapper Approaches (2/2) • Advantage: The inductive method that uses the feature subset provides a better estimate of accuracy than a separate measure that may have an entirely different inductive bias. • Disadvantage: Computational cost, which results from calling the induction algorithm for each feature set considered. • Modifications: • Caching decision trees • Reducing percentage of training cases

OBLIVION Algorithm • It carries out a backward elimination search through the space of feature sets. • Start with all features and iteratively remove the one that leads to a tree that has the greatest improvement in the estimated accuracy. • Continue this process till there is a constant improvement in accuracy. Langley et al, 1994.

Comparing Various Wrapper Approaches Blum et al, 1997.

Feature Selection v/s Feature Weighting

Winnow Algorithm Initialize the weights w1, w2,…, wn of the features to 1. Given an example (x1,…, xn), output 1 if w1x1+…+wnxn≥n, and output 0 otherwise. If algorithm predicts negative on a positive sample. If algorithm predicts positive on a negative sample. For each xi equal to 1, double the value of wi. For each xi equal to 1, cut the value of wi to half. Littlestone, 1988.

References • Avrim L. Blum, Pat Langley, Selection of relevant features and examples in machine learning,Artificial IntelligenceVolume 97, Issues 1-2, Pages 245-271, (1997). • D. Aha, A study of instance-based algorithms for supervised learning tasks: mathematical, empirical and psychological evaluations. University of California, Irvine, CA, (1990). • K. Verbeurgt, Learning DNF under the uniform distribution in polynomial time. In: Proceedings 3rd Annual Workshop on Computational Learning TheorySan Francisco, CA, , Morgan Kaufmann, San Mateo, CA, pp. 314–325, (1990). • T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE Trans. Inform. Theory13, pp. 21–27, (1967). • P. Langley and W. Iba, Average-case analysis of a nearest neighbor algorithm. In: Proceedings IJCAI-93, pp. 889–894, (1993).

References (contd…) • G.H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset selection problem. In: Proceedings 11th International Conference on Machine LearningNew Brunswick, NJ, , Morgan Kaufmann, San Mateo, CA, pp. 121–129, (1994). • J.R. Quinlan, Learning efficient classification procedures and their application to chess end games. In: R.S. Michalski, J.G. Carbonell and T.M. Mitchell, Editors, Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA (1983). • J.R. Quinlan. In: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA (1993). • C.J. Matheus and L.A. Rendell, Constructive induction on decision trees. In: Proceedings IJCAI-89Detroit, MI, , Morgan Kaufmann, San Mateo, CA, pp. 645–650, (1989). • N. Littlestone, Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning2, pp. 285–318, (1988).

Selection of Relevant Features and Examples in Machine Learning