Learning Theory Put to Work

Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com

What is the process ofData Mining / Machine Learning? Learning algorithm Trained machine TRAINING DATA Answer ? Query

For which tasks ? • Classification(binary/categorical target) • Regressionandtime series prediction(continuous targets) • Clustering(targets unknown) • Rule discovery

For which applications ? training examples Customer knowledge Quality control Market Analysis 106 OCR HWR Machine vision 105 Text Categorization 104 103 System diagnosis Bioinformatics 102 10 inputs 10 102 103 104 105

Banking / Telecom / Retail • Identify: • Prospective customers • Dissatisfied customers • Good customers • Bad payers • Obtain: • More effective advertising • Less credit risk • Fewer fraud • Decreased churn rate

Biomedical / Biometrics • Medicine: • Screening • Diagnosis and prognosis • Drug discovery • Security: • Face recognition • Signature / fingerprint / iris verification • DNA fingerprinting

Computer / Internet • Computer interfaces: • Troubleshooting wizards • Handwriting and speech • Brain waves • Internet • Hit ranking • Spam filtering • Text categorization • Text translation • Recommendation

From Statistics to Machine Learning… and back! • Old textbook statistics were descriptive: • Mean, variance • Confidence intervals • Statistical tests • Fit data, discover distributions (past data) • Machine learning (1960’s) is predictive : • Training / validation / test sets • Build robust predictive models (future data) • Learning theory (1990’s) : • Rigorous statistical framework for ML • Proper monitoring of fit vs. robustness

Some Learning Machines • Linear models • Polynomial models • Kernel methods • Neural networks • Decision trees

w a Conventions n attributes/features X={xij} m samples /customers /patients y ={yj} xi

Linear Models f(x) = Sj=1:n wj xj + b Linear discriminant (for classification): • F(x) = 1 if f(x)>0 • F(x) = -1 if f(x)0 LINEAR = WEIGHTED SUM

Non-linear models Linear models (artificial neurons) • f(x) = Sj=1:n wj xj + b Models non-linear in their inputs, butlinear in their parameters • f(x) = Sj=1:N wjfj(x) + b(Perceptron) • f(x) = Si=1:maik(xi,x) + b(Kernel method) Other non-linear models • Neural networks / multi-layer perceptrons • Decision trees

x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 Linear Decision Boundary hyperplane

x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 NL Decision Boundary

x2 x1 Fit / Robustness Tradeoff x2 x1

Predictions: F(x) Predictions: F(x) Predictions: F(x) Predictions: F(x) Cost matrix Cost matrix Cost matrix Cost matrix Class +1 Class +1 Total Total Class +1 / Total Class -1 Class -1 Class +1 Class +1 Total Class +1 / Total Class -1 Class -1 Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp neg=tn+fp False alarm = fp/neg Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp False alarm = fp/neg tp tp pos=fn+tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn tp tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn m=tn+fp +fn+tp m=tn+fp +fn+tp Frac. selected = sel/m Total Total rej=tn+fn rej=tn+fn sel=fp+tp sel=fp+tp m=tn+fp +fn+tp Frac. selected = sel/m Total rej=tn+fn sel=fp+tp Class+1 /Total Precision= tp/sel Performance Assessment False alarm rate =type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power • Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) • Vary the decision threshold q in F(x) = sign(f(x)+q), and plot: • ROC curve: Hit ratevs.False alarm rate • Lift curve: Hit ratevs. Fraction selected • Precision/recall curve: Hit ratevs. Precision

ROC Curve Ideal ROC curve (AUC=1) 100% Hit rate = Sensitivity Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. Actual ROC Random ROC (AUC=0.5) 0  AUC  1 0 100% False alarm rate = 1 - Specificity

Lift Curve Ideal Lift Customers ranked according to f(x); selection of the top ranking customers. 100% Hit rate = Frac. good customers select. Actual Lift M O Random lift Gini=2AUC-1 0  Gini  1 100% 0 Fraction of customers selected

What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: • Error rate:(1/m) Si=1:m1(F(xi)yi) • 1- AUC(Gini Index = 2 AUC-1) • Regression: • Mean square error:(1/m) Si=1:m(f(xi)-yi)2

R[f(x,w)] Parameter space (w) w* How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.)

Theoretical Foundations • Structural Risk Minimization • Regularization • Weight decay • Feature selection • Data compression Training powerful models, without overfitting

Ockham’s Razor • Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”. • Of two theories providing similarly good predictions, prefer the simplest one. • Shave off unnecessary parameters of your models.

loss function unknown distribution Risk Minimization • Learning problem: find the best function f(x; a) minimizing a risk functional • R[f] =  L(f(x; w), y) dP(x, y) • Examples are given: (x1, y1), (x2, y2), … (xm, ym)

Approximations of R[f] • Empirical risk: Rtrain[f] = (1/n)i=1:m L(f(xi; w), yi) • 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate • square loss (f(xi)-yi)2 : Rtrain[f] = mean square error • Guaranteed risk: With high probability (1-d), R[f]Rgua[f] Rgua[f] =Rtrain[f]+ e(d,C)

Vapnik, 1974 Nested subsets of models, increasing complexity/capacity: S1 S2 … SN S3 Increasing complexity Ga, Guaranteed risk Ga= Tr + e(C) S2 e, Function of Model Complexity C S1 Tr, Training error Complexity/Capacity C Structural Risk Minimization

S1 S2 … SN R capacity SRM Example • Rank with ||w||2 = Si wi2 Sk = { w | ||w||2< wk2 }, w1<w2<…<wk • Minimization under constraint: min Rtrain[f] s.t. ||w||2< wk2 • Lagrangian: Rreg[f,g] = Rtrain[f] + g ||w||2

Multiple Structures • Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2< wk }, w1<w2<…<wk g1 > g2 > g3 >… > gk (gis the ridge) • Feature selection: Sk = { w | ||w||0< sk }, s1<s2<…<sk (sis the number of features) • Data compression: k1<k2<…<kk (kmay be the number of clusters)

y X Training data: Make K folds Test data Prospective study / “real” validation Hyper-parameter Selection • Learning = adjusting: parameters(w vector). • hyper-parameters(g, s, k). • Cross-validation with K-folds: For various values of g, s, k: - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10th. - Test on 1/K remaining examples e.g. 1/10th. - Rotate examples and average test results (CV error). - Select g, s, k to minimize CV error. - Re-compute w on all training examples using optimal g, s, k.

Summary • SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity. • Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.

k k x y KXEN (simplified) architecture k, s D D L a t Class of Models a o a t s w a s P r E C e n r i p c t a o e Learning r r g d i a i t o Algorithm n i n o g n

KXEN: SRM put to work Ideal Lift CV lift Customers ranked according to f(x); selection of the top ranking customers. 100% Fraction of good customers selected Training lift O M Test lift Random lift G 100% Fraction of customers selected

Want to Learn More? • Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. • Pattern Classification,R. Duda, P. Hart, and D. Stork.Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds.Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book

Learning Theory Put to Work

Learning Theory Put to Work

Presentation Transcript

“… put a team to work for you.”

Learning Theory

Put it to work for you

Learning to Put the Student First

Learning Theory

Learning “theory”

Put your home to work for you

Put themes to work for you

Introduction to Learning Theory

Put Your Web Site to Work

Assistive technologies put to work at Wediko

Put your IT investment to work

Put Orange County Back to Work

Learning Theory

Learning Theory

Learning to work safely!

Learning Theory

Put your IT investment to work

Put your IT investment to work

Put Your Money to Work

LEARNING THEORY