280 likes | 398 Views
Learning from observations (b). How good is a machine learner? Experimentation protocols Performance measures Academic benchmarks vs Real Life. Experimentation protocols. Fooling yourself: training a decision tree on 100 example instances from earth and sending the robot to Mars
E N D
Learning from observations (b) • How good is a machine learner? • Experimentation protocols • Performance measures • Academic benchmarks vs Real Life KI1 / L. Schomaker - 2007
Experimentation protocols • Fooling yourself: training a decision tree on 100 example instances from earth and sending the robot to Mars • training set / test set distinction • both must be of sufficient size: • large training set for reliable ‘h’ (coefficients etc.) • large test set for reliable prediction of real-lifeperformance KI1 / L. Schomaker - 2007
Experimentation protocols • one training set / one test set, four yearsPhD project: still fooling yourself! • Solution: • training set • test set • final evaluation set with real-life data • k-Fold evaluation: k subsets fromlarge data base, measuring standard deviationof performance over experiments KI1 / L. Schomaker - 2007
Experimentation protocols • What to do if your don’t have enough data? • Solution: • Leave-one-out: use N-1 samples for training, • use the Nth sample for testing • repeat for all samples • compute the average performance KI1 / L. Schomaker - 2007
Performance • Example: % correctly classified samples (P) • Ptrain • Ptest • Preal Ptest KI1 / L. Schomaker - 2007
Performance, two-class KI1 / L. Schomaker - 2007
Performance, two-class KI1 / L. Schomaker - 2007
Performance, two-class Precision = 100 * #correct_hits / #says_Yes [%] Recall = 100 * #correct_hits / #is_Yes [%] KI1 / L. Schomaker - 2007
Performance, multi-class KI1 / L. Schomaker - 2007
Performance, multi-class Confusion matrix KI1 / L. Schomaker - 2007
Rankings / hit lists • Given a query Q, systems returns a hitlist of matches M: an ordered set, with instances i in decreasing likelihood of correctness • Precision: proportion of correct instances M in the hit list • Recall: proportion of correct instances from totalnumber of target samples in the database KI1 / L. Schomaker - 2007
Function approximation • For, e.g. regression models,learning an ‘analog’output • Example: target function t(x) • Obtained output function o(x) • For performance evaluation computeroot-mean square error (RMS error): • = ( (o(x)-t(x))2 / N ) KI1 / L. Schomaker - 2007
Learning curves P [% OK] #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves performance on training set P [% OK] performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves 100% P [% OK] performance on training set performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves 100% P [% OK] performance on training set no generalization,overfit Stop! performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Overfitting • The learner learns the training set • Even perfectly, like a lookup table (LUT) • memorizing training instances • without correctly handling unseen data • Usual cause: more parameters in the learner than in the data KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • number of training examples must be much larger than the number of attributes (features): Nsamples / Nattr >> 1 KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • also: Nsamples >> Ncoefficients e.g.: solving linear equation: 2 coefficients, needing 2 data points in 2D Coefficients: model parameters, weights etc. KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • Ndatavalues >> Ncoefficients Coefficients: model parameters, weights etc. Ndatavalues = Nsamples * Nattributes e.g.: use Ndatavalues/Ncoefficients for system comparison KI1 / L. Schomaker - 2007
Example: machine-print OCR • Very accurate, today, but: • Needs 5000 examples of each character • Printed on ink-jet, laser printers, matrixprinters, fax copies • of many brands of printers • on many paper types • for 1 font & point size! . . A . . . KI1 / L. Schomaker - 2007
Ensemble methods • Boosting: • train a learner h[m] • weigh each of the instances • weigh the method m • train a new learner h[m+1] • perform majority voting on ensembleopinions KI1 / L. Schomaker - 2007
The advantage of democracy: partly intelligent, independent deciders KI1 / L. Schomaker - 2007
Learning methods • Gradient descent, parameter finding(multi-layer perceptron, regression) • Expectation Maximization (smart Monte Carlo search for best model, given the data) • Knowledge-based, symbolic learning (Version Spaces) • Reinforcement learning • Bayesian learning KI1 / L. Schomaker - 2007
Memory-based ‘learning’ • Lookup-table (LUT) • Nearest neighbour argmin(dist) • k-Nearest neighbour majority(Nargmin(dist,k) KI1 / L. Schomaker - 2007
Unsupervised learning • K-means clustering • Kohonen self-organizing maps (SOM) • Hierarchical clustering KI1 / L. Schomaker - 2007
Summary (1) • Learning needed for unknown environments (and/or) lazy designers • Learning agent = performance element + learning element • Learning method depends on type of performance element, available • feedback, type of component to be improved, and its representation KI1 / L. Schomaker - 2007
Summary (2) • For supervised learning, the aim is to find a simple hypothesis • that is approximately consistent with training examples • Decision tree learning using information gain: entropy-based • Learning performance = prediction accuracy measured on test set(s) KI1 / L. Schomaker - 2007