1 / 302

Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned

Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned. Kirk Scott. One thing you’d like to do is evaluate the performance of a data mining algorithm More specifically, you’d like to evaluate results obtained by applying a data mining algorithm to a certain data set

joy
Download Presentation

Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningChapter 5Credibility: Evaluating What’s Been Learned Kirk Scott

  2. One thing you’d like to do is evaluate the performance of a data mining algorithm • More specifically, you’d like to evaluate results obtained by applying a data mining algorithm to a certain data set • For example, can you estimate what percent of new instances it will classify correctly?

  3. Problems include: • The performance level on the training set is not a good indicator of performance on other data sets • Data may be scarce, so that enough data for separate training and test data sets may be hard to come by • A related question is how to compare the performance of 2 different algorithms

  4. Parameters for evaluation and comparison: • Are you predicting classification? • Or are you predicting the probability that an instance falls into a classification? • Or are you doing numeric prediction? • What does performance evaluation mean for things other than prediction, like association rules?

  5. Can you define the cost of a misclassification or other “failure” of machine learning? • There are false positives and false negatives • Different kinds of misclassifications will have different costs in practice

  6. There are broad categories of answers to these questions • For evaluation of one algorithm, a large amount of data makes estimating performance easy • For smaller amounts of data, a technique called cross-validation is commonly used • Comparing the performance of different data mining algorithms relies on statistics

  7. 5.1 Training and Testing

  8. For tasks like prediction, a natural performance measure is the error rate • This is the number of incorrect predictions out of all made • How to estimate this?

  9. A rule set, a tree, etc. may be imperfect • It may not classify all of the training set instances correctly • But it has been derived from the training set • In effect, it is optimized for the training set • Its error rate on an independent test set is likely to be much higher

  10. The error rate on the training set is known as the resubstitution error • You put the same data into the classifier that was used to create it • The resubstitution error rate may be of interest • But it is not a good estimate of the “true” error rate

  11. In a sense, you can say that the classifier, by definition, is overfitted on the training set • If the training set is a very large sample, its error rate could be closer to the true rate • Even so, as time passes and conditions change, new instances might not fall into quite the same distribution as the training instances and will have a higher rate of misclassification

  12. Not surprisingly, the way to estimate the error rate is with a test set • You’d like both the training set and the test set to be representative samples of all possible instances • It is important that the training set and the test set be independent • Any test data should have played no role in the training of the classifier

  13. There is actually a third set to consider • The validation set • It may serve either of these two purposes: • Selecting one of several possible data mining algorithms • Optimizing the one selected

  14. All three sets should be independent • Train only with the training set • Validate only with the validation set • Test only with the test set • In general, error rate estimation is done on the test set • After all decisions are made, it is permissible to combine all test sets and retrain on this superset for better results

  15. 5.2 Predicting Performance

  16. This section can be dealt with quickly • Its main conclusions are based on statistical concepts, which are presented in a box • You may be familiar with the derivations from statistics class • They are beyond the scope of this course

  17. These are the basic ideas: • Think in terms of a success rate rather than an error rate • Based on a sample of n instances, you have an observed success rate in the test data set

  18. Statistically, you can derive a confidence interval around the observed rate that depends on the sample size • Doing this provides more complete knowledge about the real success rate, whatever it might be, based on the observed success rate

  19. 5.3 Cross-Validation

  20. Cross-validation is the general term for techniques that can be used to select training/testing data sets and estimate performance when the overall data set is limited in size • In simple terms, this might be a reasonable rule of thumb: • Hold out 1/3 of the data for testing, for example, and use the rest for training (and validation)

  21. Statistical Problems with the Sample • Even if you do a proper random sample, poorly matched test and training sets can lead to poor error estimates • Suppose that all of the instances of one classification fall into the test set and none fall into the training set

  22. The rules resulting from training will not be able to classify those test set instances • The error rate will be high (justifiably) • But a different selection of training set and test set would give rules that covered that classification • And when the rules are evaluated using the test set, a more realistic error rate would result

  23. Stratification • Stratification can be used to better match the test set and the training set • You don’t simply obtain the test set by random sampling • You randomly sample from each of the classifications in the overall data set in proportion to their occurrence there • This means all classifications are represented in both the test and training sets

  24. Repeated Holdout • With additional computation, you can improve on the error estimate obtained from a stratified sample alone • n times, randomly hold out 1/3 of the instances for training, possibly using stratification • Average the error rates over the n times • This is called repeated holdout

  25. Cross-Validation • The idea of multiple samples can be generalized • Partition the data into n “folds” (partitions) • Do the following n times: • Train on (n – 1) of the partitions • Test on the remaining 1 • Average the error rates over the n times

  26. Stratified 10-Fold Cross-Validation • The final refinement of cross validation is to make the partitions so that all classifications are roughly in proportion • The standard rule of thumb is to use 10 partitions • Why? • The answer boils down to this essentially: • Human beings have 10 fingers…

  27. In general, it has been observed that 10 partitions lead to reasonable results • 10 partitions is not computationally out of the question

  28. The final refinement presented in the book is this: • If you want really good error estimates, do 10-fold cross validation 10 times with different stratified partitions and average the results • At this point, estimating error rates has become a computationally intensive task

  29. 5.4 Other Estimates

  30. 10 times 10-fold stratified cross-validation is the current standard for estimating error rates • There are other methods, including these two: • Leave-One-Out Cross-Validation • The Bootstrap

  31. Leave-One-Out Cross-Validation • This is basically n-fold cross validation taken to the max • For a data set with n instances, you hold out one for testing and train on the remaining (n – 1) • You do this for each of the instances and then average the results

  32. This has these advantages: • It’s deterministic: • There’s no sampling • In a sense, you maximize the information you can squeeze out of the data set

  33. It has these disadvantages: • It’s computationally intensive • By definition, a holdout of 1 can’t be stratified • By definition, the classification of a single instance will not conform to the distribution of classifications in the remaining instances • This can lead to poor error rate estimates

  34. The Bootstrap • This is another technique that ultimately relies on statistics (presented in a box) • The question underlying the development and use of the bootstrap technique is this: • Is there a way of estimating the error rate that is especially well-suited to small data sets?

  35. The basic statistical idea is this: • Do not just randomly select a test set from the data set • Instead, select a training set (the larger of the two sets) using sampling with replacement

  36. Sampling with replacement in this way will lead to these results: • Some of the instances will not be selected for the training set • These instances will be the test set • By definition, you expect duplicates in the training set

  37. Having duplicates in the training set will mean that it is a poorer match to the test set • The error estimate is skewed to the high side • This is corrected by combining this estimate with the resubstitution error rate, which is naturally skewed to the low side

  38. This is the statistically based formula for the overall error rate: • Overall error rate = • .632 * test set error rate • + .368 * training set error rate • To improve the estimate, randomly sample with replacement multiple times and average the results

  39. Like the leave-out-one technique, there are cases where this technique does not give good results • This depends on the distribution of the classifications in the overall data set • Sampling with replacement in some cases can lead to the component error rates being particularly skewed in such a way that they don’t compensate for each other

  40. 5.5 Comparing Data Mining Schemes

  41. This section can be dealt with quickly because it is largely based on statistical derivations presented in a box • The fundamental idea is this: • Given a data set (or data sets) and two different data mining algorithms, you’d like to choose which one is best

  42. For researchers in the field, this is the broader question: • Across all possible data sets in a given problem domain (however that may be defined) which algorithm is superior overall? • We’re really only interested in the simpler, applied question

  43. At heart, to compare two algorithms, you compare their estimated error or success rates • In a previous section it was noted that you can get a confidence interval for an estimated success rate • In a situation like this, you have two probabilistic quantities you want to compare

  44. The paired t-test is an established statistical technique for comparing two such quantities and determining if they are significantly different

  45. 5.6 Predicting Probabilities

  46. The discussion so far had to do with evaluating schemes that do simple classification • Either an instance is predicted to be in a certain classification or it isn’t • Numerically, you could say a successful classification was a 1 and an unsuccessful classification was a 0 • In this situation, evaluation boiled down to counting the number of successes

  47. Quadratic Loss Function • Some classification schemes are finer tuned • Instead of just predicting a classification, they produce the following: • For each instance examined, a probability that that instance falls into each of the classes

  48. The set of k probabilities can be thought of as a vector of length k containing elements pi • Each pi represents the probability of one of the classifications for that instance • The pi sum to 1

  49. The following discussion will be based on the case of one instance, not initially considering the fact that a data set will contain many instances • How do you judge the goodness of the vector of pi‘s that a data mining algorithm produces?

  50. Evaluation is based on what is known as a loss function • Loss is a measure of the probabilities against the actual classification found • A good prediction of probabilities should have a low loss value • In other words, the difference is small between what is observed and the predicted probabilities

More Related