1 / 32

Learning Theory Put to Work

Learning Theory Put to Work. Isabelle Guyon isabelle@clopinet.com. What is the process of Data Mining / Machine Learning?. Learning algorithm. Trained machine. TRAINING DATA. Answer. ?. Query. For which tasks ?. Classification (binary/categorical target)

olympe
Download Presentation

Learning Theory Put to Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com

  2. What is the process ofData Mining / Machine Learning? Learning algorithm Trained machine TRAINING DATA Answer ? Query

  3. For which tasks ? • Classification(binary/categorical target) • Regressionandtime series prediction(continuous targets) • Clustering(targets unknown) • Rule discovery

  4. For which applications ? training examples Customer knowledge Quality control Market Analysis 106 OCR HWR Machine vision 105 Text Categorization 104 103 System diagnosis Bioinformatics 102 10 inputs 10 102 103 104 105

  5. Banking / Telecom / Retail • Identify: • Prospective customers • Dissatisfied customers • Good customers • Bad payers • Obtain: • More effective advertising • Less credit risk • Fewer fraud • Decreased churn rate

  6. Biomedical / Biometrics • Medicine: • Screening • Diagnosis and prognosis • Drug discovery • Security: • Face recognition • Signature / fingerprint / iris verification • DNA fingerprinting

  7. Computer / Internet • Computer interfaces: • Troubleshooting wizards • Handwriting and speech • Brain waves • Internet • Hit ranking • Spam filtering • Text categorization • Text translation • Recommendation

  8. From Statistics to Machine Learning… and back! • Old textbook statistics were descriptive: • Mean, variance • Confidence intervals • Statistical tests • Fit data, discover distributions (past data) • Machine learning (1960’s) is predictive : • Training / validation / test sets • Build robust predictive models (future data) • Learning theory (1990’s) : • Rigorous statistical framework for ML • Proper monitoring of fit vs. robustness

  9. Some Learning Machines • Linear models • Polynomial models • Kernel methods • Neural networks • Decision trees

  10. w a Conventions n attributes/features X={xij} m samples /customers /patients y ={yj} xi

  11. Linear Models f(x) = Sj=1:n wj xj + b Linear discriminant (for classification): • F(x) = 1 if f(x)>0 • F(x) = -1 if f(x)0 LINEAR = WEIGHTED SUM

  12. Non-linear models Linear models (artificial neurons) • f(x) = Sj=1:n wj xj + b Models non-linear in their inputs, butlinear in their parameters • f(x) = Sj=1:N wjfj(x) + b(Perceptron) • f(x) = Si=1:maik(xi,x) + b(Kernel method) Other non-linear models • Neural networks / multi-layer perceptrons • Decision trees

  13. x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 Linear Decision Boundary hyperplane

  14. x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 NL Decision Boundary

  15. x2 x1 Fit / Robustness Tradeoff x2 x1

  16. Predictions: F(x) Predictions: F(x) Predictions: F(x) Predictions: F(x) Cost matrix Cost matrix Cost matrix Cost matrix Class +1 Class +1 Total Total Class +1 / Total Class -1 Class -1 Class +1 Class +1 Total Class +1 / Total Class -1 Class -1 Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp neg=tn+fp False alarm = fp/neg Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp False alarm = fp/neg tp tp pos=fn+tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn tp tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn m=tn+fp +fn+tp m=tn+fp +fn+tp Frac. selected = sel/m Total Total rej=tn+fn rej=tn+fn sel=fp+tp sel=fp+tp m=tn+fp +fn+tp Frac. selected = sel/m Total rej=tn+fn sel=fp+tp Class+1 /Total Precision= tp/sel Performance Assessment False alarm rate =type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power • Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) • Vary the decision threshold q in F(x) = sign(f(x)+q), and plot: • ROC curve: Hit ratevs.False alarm rate • Lift curve: Hit ratevs. Fraction selected • Precision/recall curve: Hit ratevs. Precision

  17. ROC Curve Ideal ROC curve (AUC=1) 100% Hit rate = Sensitivity Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. Actual ROC Random ROC (AUC=0.5) 0  AUC  1 0 100% False alarm rate = 1 - Specificity

  18. Lift Curve Ideal Lift Customers ranked according to f(x); selection of the top ranking customers. 100% Hit rate = Frac. good customers select. Actual Lift M O Random lift Gini=2AUC-1 0  Gini  1 100% 0 Fraction of customers selected

  19. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: • Error rate:(1/m) Si=1:m1(F(xi)yi) • 1- AUC(Gini Index = 2 AUC-1) • Regression: • Mean square error:(1/m) Si=1:m(f(xi)-yi)2

  20. R[f(x,w)] Parameter space (w) w* How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.)

  21. Theoretical Foundations • Structural Risk Minimization • Regularization • Weight decay • Feature selection • Data compression Training powerful models, without overfitting

  22. Ockham’s Razor • Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”. • Of two theories providing similarly good predictions, prefer the simplest one. • Shave off unnecessary parameters of your models.

  23. loss function unknown distribution Risk Minimization • Learning problem: find the best function f(x; a) minimizing a risk functional • R[f] =  L(f(x; w), y) dP(x, y) • Examples are given: (x1, y1), (x2, y2), … (xm, ym)

  24. Approximations of R[f] • Empirical risk: Rtrain[f] = (1/n)i=1:m L(f(xi; w), yi) • 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate • square loss (f(xi)-yi)2 : Rtrain[f] = mean square error • Guaranteed risk: With high probability (1-d), R[f]Rgua[f] Rgua[f] =Rtrain[f]+ e(d,C)

  25. Vapnik, 1974 Nested subsets of models, increasing complexity/capacity: S1 S2 … SN S3 Increasing complexity Ga, Guaranteed risk Ga= Tr + e(C) S2 e, Function of Model Complexity C S1 Tr, Training error Complexity/Capacity C Structural Risk Minimization

  26. S1 S2 … SN R capacity SRM Example • Rank with ||w||2 = Si wi2 Sk = { w | ||w||2< wk2 }, w1<w2<…<wk • Minimization under constraint: min Rtrain[f] s.t. ||w||2< wk2 • Lagrangian: Rreg[f,g] = Rtrain[f] + g ||w||2

  27. Multiple Structures • Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2< wk }, w1<w2<…<wk g1 > g2 > g3 >… > gk (gis the ridge) • Feature selection: Sk = { w | ||w||0< sk }, s1<s2<…<sk (sis the number of features) • Data compression: k1<k2<…<kk (kmay be the number of clusters)

  28. y X Training data: Make K folds Test data Prospective study / “real” validation Hyper-parameter Selection • Learning = adjusting: parameters(w vector). • hyper-parameters(g, s, k). • Cross-validation with K-folds: For various values of g, s, k: - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10th. - Test on 1/K remaining examples e.g. 1/10th. - Rotate examples and average test results (CV error). - Select g, s, k to minimize CV error. - Re-compute w on all training examples using optimal g, s, k.

  29. Summary • SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity. • Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.

  30. k k x y KXEN (simplified) architecture k, s D D L a t Class of Models a o a t s w a s P r E C e n r i p c t a o e Learning r r g d i a i t o Algorithm n i n o g n

  31. KXEN: SRM put to work Ideal Lift CV lift Customers ranked according to f(x); selection of the top ranking customers. 100% Fraction of good customers selected Training lift O M Test lift Random lift G 100% Fraction of customers selected

  32. Want to Learn More? • Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. • Pattern Classification,R. Duda, P. Hart, and D. Stork.Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds.Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book

More Related