1 / 40

Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 機器學習與生物資訊學. Evaluation. The key to success. Three datasets. of which the answers must be known. Note on parameter tuning. It is important that the  testing data is not used in any way to create the classifier

admon
Download Presentation

Machine Learning and Bioinformatics 機器學習與生物資訊學

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning and Bioinformatics機器學習與生物資訊學 Machine Learning & Bioinformatics

  2. Evaluation The key to success Machine Learning and Bioinformatics

  3. Three datasets of which the answers must be known Machine Learning and Bioinformatics

  4. Note on parameter tuning • It is important that the testing data is not used in any way to create the classifier • Some learning schemes operate in two stages • build the basic structure • optimize parameters • Thetesting data can not be used for parameter tuning • proper procedure uses three sets: training,tuning and testing data Machine Learning and Bioinformatics

  5. Data is usually limited • Error on the training data is NOT a good indicator of performance on future data • otherwise 1­NN would be the optimum classifier • Not a problem if lots of (answered) data is available • split data into training, turning and testing sets • However, (answered) data is usually limited • More sophisticated techniques need to be used Machine Learning and Bioinformatics

  6. Issues in evaluation • Statistical reliability of estimated differences in performance significancetests • Choice of performance measures • number of correctly classified samples • ratio of correctly classified samples • error in numeric predictions • Costs assigned to different types of errors • many practical applications involve costs Machine Learning and Bioinformatics

  7. Training and testing sets • Testing set mustplay no part, including parameter tuning, in classifier formation • Ideally, both training and testing sets are representative samples of the underlying problem, but they may differ in nature • we got data from two different towns A and B and want to estimate the performance of our classifier in a completely new town Machine Learning and Bioinformatics

  8. Which (training vs. tuning/testing) should be more similar to the target new town? Machine Learning and Bioinformatics

  9. Making the most of the data • Once evaluation is complete, all the data can be  used to build the final classifier for real (unknown) data • A dilemma • generally, the larger the training data the better the classifier (but returns diminish) • the larger the testing data the more accurate the  error estimate Machine Learning and Bioinformatics

  10. Holdout procedure • Method of splitting original data into training and testing sets • Reserve a certain amount for testing and use the remainder for training • usually one third for testing and the rest for training • The samples might not be representative • e.g., class might be missing in the testing data • Stratification • ensures that each class is represented with approximately equal proportions in both subsets Machine Learning and Bioinformatics

  11. Repeated holdout procedure • Holdout procedure can be made more reliable by repeating the process with different subsamples • in each iteration, a certain proportion is randomly selected for testing (possibly with stratification) • the error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout procedure • A problem is that the different testing sets overlap Machine Learning and Bioinformatics

  12. Cross­-validation • Cross-­validation avoids overlapping test sets • split data into nsubsets of equal size • use each subset in turn for testing, the remainder for training • the error estimates are averaged to yield an overall error estimate • Called n-­foldcross-­validation • Often the subsets are stratified before the cross-validation is performed Machine Learning and Bioinformatics

  13. More on cross­-validation • Stratified ten-­fold cross-­validation • Why ten? • extensive experiments have shown that this is the best choice to get an accurate estimate • there is also some theoretical evidence for this • Repeated stratified cross-­validation • e.g., ten-­fold cross-­validation is repeated ten times and results are averaged (reduces the variance) Machine Learning and Bioinformatics

  14. Leave-­One-­Out cross-­validation • A particular form of cross­-validation • set number of folds to number of training instances • Makes best use of the data • Involves no random subsampling • Very computationally expensive Advantage and disadvantage Machine Learning and Bioinformatics

  15. LOO-CV and stratification • Stratification is not possible • there is only one instance in the testing set • An extremeexample • random dataset split equally into two classes • best inducer predicts majority class • 50% accuracy on fresh data • LOO-CV estimate is 100% error Machine Learning and Bioinformatics

  16. Cost Machine Learning and Bioinformatics

  17. Counting the cost • In practice, different types of classification errors often incur different costs • Examples • terrorist profiling, where predicting ‘negative’ achieves 99.99% accuracy • loan decisions • oil-­slick detection • fault diagnosis • promotional mailing Machine Learning and Bioinformatics

  18. Confusion matrix Machine Learning and Bioinformatics

  19. Classification with costs • Two cost matrices • Error rate is replaced by average cost per prediction Machine Learning and Bioinformatics

  20. Cost­-sensitive learning • A basicidea is to only predict high-­cost class when very confident about the prediction • Instead predicting the most likely class, we should make the prediction that minimizes the expected cost • dot product of class probabilities and appropriate column in cost matrix • choose column (class) that minimizes expected cost • Not at training time • Most learning schemes do not perform cost­ sensitive learning • they generate the same classifier no matter what costs are assigned to the different classes Machine Learning and Bioinformatics

  21. A simple method for cost­-sensitive learning Machine Learning and Bioinformatics

  22. Resampling of instances according to costs Machine Learning and Bioinformatics

  23. Measures Machine Learning and Bioinformatics

  24. Lift charts • In practice, costs are rarely known • Decisions are usually made by comparing possible scenarios • E.g., promotional mail to 1,000,000 households • mail to all; 0.1% respond (1000) • a data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) • another tool identifies subset of 400,000 most promising, 0.2% respond (800) • Which is better? • A lift chart allows a visual comparison Machine Learning and Bioinformatics

  25. Generating a lift chart • Sort instances according to predicted probability of being positive • x-axis is sample size; y-axis is number of true positives Machine Learning and Bioinformatics

  26. A hypothetical lift chart Machine Learning and Bioinformatics

  27. ROC curves • ROC curves are similar to lift charts • stands for “receiver operating characteristic” • used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel • Differences to lift chart • y-axis shows percentage of true positives in sample rather than absolute number • x-axis shows percentage of false positives in sample rather than sample size Machine Learning and Bioinformatics

  28. A sample ROC curve Jagged curve  one set of test data Smooth curve  use cross-validation Machine Learning and Bioinformatics

  29. More measures • Precision = , percentage of reported samples that are positive • Recall = , percentage of positive samples that are reported • Precision/recallcurves have hyperbolic shape • Three­-point average is the average precision at 20%, 50% and 80% recall • F-­measure = , harmonic mean of precision and recall • makes precision and recall as equal as possible • Specificity = , percentage of negative samples that are not reported • Area under the ROC curve (AUC) Machine Learning and Bioinformatics

  30. Summary of some measures Machine Learning and Bioinformatics

  31. Evaluating numeric prediction Same strategies, including independent testing sets,cross-validation, significance tests, etc. Machine Learning and Bioinformatics

  32. Measures in numeric prediction • Actual target values:  • Predicted target values:  • The most popular measure is mean­ squared error (MSE), , because it is easy to manipulate mathematically Machine Learning and Bioinformatics

  33. Other measures • Root mean ­squarederror (RMSE) = • Mean absolute error (MAE), , is less sensitive to outliers than MSE • Sometimes relative error values are more appropriate Machine Learning and Bioinformatics

  34. Improvement on the mean • How much does the scheme improve on simply predicting the average? • Relative squared error = • Relative absolute error = Machine Learning and Bioinformatics

  35. Correlation coefficient / 相關係數 • Measures the statistical correlation between the predicted values and the actual values • Scale independent, between –1 and +1 • Good performance leads to large values Machine Learning and Bioinformatics

  36. http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gifhttp://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif

  37. Which measure? • Best to look at all of them • Often it doesn’t matter • D the best; C the second-best; A and B are arguable Machine Learning and Bioinformatics

  38. Today’s exercise Machine Learning & Bioinformatics

  39. Parameter tuning Design your own select, feature, buy and sell programs. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 11/5 (Mon). Machine Learning & Bioinformatics

  40. Possible ways • Enlarge parameter range in CV • Stratified, repeated… • minimize the variance • Make turning set • use large training set; make tuning set as similar to the target stocks as possible • Cost matrix • resampling, otherwise it would be very difficult • Change measures • or plot ROC curves to understand your classifiers • The best measure is the transaction profit, but it requires the simulation system. Instead, you can develop a comprising evaluation script, which is more complicated than any theoretic measures but simpler than the real problem. This is usually required in practice. Machine Learning and Bioinformatics

More Related