Learning Agents Laboratory Computer Science Department George Mason University

CS 782 Machine Learning 5. Evaluation of Empirical Inductive Learners Prof. Gheorghe Tecuci Learning Agents Laboratory Computer Science Department George Mason University

Overview Introduction Computational learning theory Empirical evaluation: Single partitioning Empirical evaluation: Resampling Recommended reading

Introduction • Suppose we have collected a body of training examples, adopted a learning bias, implemented the learning algorithm, executed the algorithm, and learned the concept “c” represented by the examples. • There are several questions we may ask about this process: • Can we believe the we have learned the right concept? • What is the likelihood that “c” will correctly classify previously unseen examples? • How can we have confidence that the concept “c” is approximately correct? • There are two possible answers, a theoretical answer and an experimental one.

The Computational Learning Theory The Computational Learning Theory, pioneered by Valiant, is concerned with finding theoretical answers to the previous questions. In this theory, learning is viewed as function reconstruction Given: a set of input-output pairs {x, f(x)} for a boolean function f. Determine: an expression f1 that provides a good approximation of the boolean function f f : {0,1}n --> {0,1} The Valiant framework provides bounds on the number of training examples required for a given bias, in order to have high confidence that the learned hypothesis f1 is approximately correct. That is, how many training examples would one need so that the probability that the error rate of f1 is less than e is greater than 1 - d: Probability (error rate of f1 ≤e) ≥ 1 - d

The Computational Learning Theory (cont.) This style of analysis is called probably approximately correct (PAC) learning. The basic idea is to analyze the expressiveness of the hypothesis space. If a restricted hypothesis space H is very small Then it is unlikely that a learning algorithm could by chance succeed in finding a hypothesis f1 Î H consistent with the training examples. Therefore, it is more likely that f1, if it is found, is a good approximation of the correct hypothesis.

The Computational Learning Theory (cont.) The theoretical analysis has provided insight into the relationship between: - the number of training examples, - the bias of the learning algorithm, and - the confidence that we can have in the hypothesis f1 produced by the algorithm. This analysis has been successful only for simple learning algorithms. Most applied work in machine learning employs experimental techniques for determining the correctness of f1.

Simple partitioning: the holdout method 1. The available examples are randomly broken into two disjoint groups: the training set and the testing set; 2. The concept is learned by using only the examples from the training set; 3. The learned concept is then used to classify examples from the testing set; 4. The obtained results are compared with the correct classification to produce an error rate.

Discussion How does the number of examples affects the result of the evaluation? How does the distribution of examples affects the result of the evaluation? How to evaluate if there are very few examples? How to reuse examples?

Resampling: the leave-one-out method Let us consider that the number of available examples is 'n'. A concept is learned from n-1 examples and is tested on the remaining example. This is repeated n times, each time leaving out a different example. The error rate is the total number of errors on the single test case divided by n.

Discussion How is the error estimate likely to compare with single partitioning? What about the repeatability of the experimental results? Why is this important? What is a likely problem with the leave one out method and how could it be avoided?

Resampling: the cross-validation method In k-fold cross validation, the cases are randomly divided into k (usually 10) mutually disjoint sets of approximately equal size (of at least 30 examples). The concept is learned from the examples in k-1 sets, and is tested on the examples from the remaining set. This is repeated k times, once for each set (i.e. each set is once used as a test set). The average error rates over all k sets is the cross-validated error rate.

Resampling vs single partitioning Resampling is a powerful idea. With a single train and test partition, too few cases in the training group can lead to the learning of a poor concept, while too few test cases can lead to erroneous error estimates. Resampling allows for more accurate estimates of the error rates while training on most cases. Resampling allows the duplication of the analysis conditions in future experiments on the same data.

Discussion How could we compare two learning algorithms? What can be said about the result of the comparison?

Other types of experiments • Determine other characteristics of the learning methods: • the speed of learning; • the asymptotic behavior and the number of examples needed to approximate this behavior; • predictive accuracy versus concept complexity; • the influence of different types of noise on the predictive accuracy; • the influence of different biases on the predictive accuracy; • etc.

Recommended reading Mitchell T.M., Machine Learning, Chapter 5: Evaluating Hypotheses, pp. 128 - 153, McGraw Hill, 1997. Weiss, S.M., Kapouleas, I., An Experimental Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods, in Readings in Machine Learning. Kibler D., Langley P., Machine Learning as an Experimental Science, in Readings in Machine Learning.

Learning Agents Laboratory Computer Science Department George Mason University