Lab 1
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

Lab 1 PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Lab 1. Getting started with Basic Learning Machines and the Overfitting Problem. Lab 1. Polynomial regression. Matlab: POLY_GUI. The code implements the ridge regression algorithm: w =argmin S i (1-y i f( x i )) 2 + g || w || 2 f( x ) = w 1 x + w 2 x 2 + … + w n x n = w x T

Download Presentation

Lab 1

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lab 1

Lab 1

Getting started with

Basic Learning Machines

and

the Overfitting Problem


Lab 11

Lab 1

Polynomial regression


Matlab poly gui

Matlab: POLY_GUI

  • The code implements the ridge regression algorithm: w=argmin Si (1-yi f(xi))2 + g|| w ||2

    f(x) = w1 x + w2 x2 + … + wn xn = wxT

    x = [x, x2, … , xn]

    wT = X+Y

    X+= XT(XXT+g)-1=(XTX+ g)-1XT

    X=[x(1); x(2); … x(p)] (matrix (p, n))

  • The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.):

  • LOO error = (1/p) Sk[ rk/1-(XX+)kk ]2


Matlab poly gui1

Matlab: POLY_GUI


Matlab poly gui2

Matlab: POLY_GUI

  • At the prompt type: poly_gui;

  • Vary the parameters. Refrain from hitting “CV”. Explain what happens in the following situations:

    • Sample num. << Target degree (small noise)

    • Large noise, small sample num

    • Target degree << Model degree

  • Why is the LOO error sometimes larger than the training and test error?

  • Are there local minima in the LOO error? Is the LOO error flat near the optimum?

  • Propose ways of getting a better solution.


Clop data objects

CLOP Data Objects

The poly_gui emulates CLOP objects of type “data”:

  • X = rand(10,5)

  • Y = rand(10,1)

  • D = data(X,Y) % constructor

  • methods(D)

  • get_X(D)

  • get_Y(D)

  • plot(D);


Clop model objects

CLOP Model Objects

poly_ridge is a “model” object.

  • P = poly_ridge; h = plot(P);

  • D = gene(P); plot(D, h);

  • Dt = gene(P);

  • [resu, fittedP] = train(P, D);

  • mse(resu)

  • tresu = test(fittedP, Dt);

  • mse(tresu)

  • plot(P, h);


Lab 12

Lab 1

Support Vector Machines


Support vector classifier

Support Vector Classifier

x2

f(x)<0

f(x)>0

f(x) = S aiyi k(x, xi)

k  SV

x=[x1,x2]

f(x)=0

x1

Boser-Guyon-Vapnik-1992


Matlab svc gui

Matlab: SVC_GUI

  • At the prompt type: svc_gui;

  • The code implements the Support Vector Machine algorithm with kernel

    k(s, t) = (1 + s t)q exp -g||s-t||2

  • Regularization similar to ridge regression:

    Hinge loss: L(xi)=max(0, 1-yi f(xi))b

    Empirical risk: Si L(xi)

    w=argmin (1/C)||w||2 + Si L(xi)

shrinkage


Lab 13

Lab 1

More loss functions…


Loss functions

Loss Functions

L(y, f(x))

Decision boundary

Margin

SVC loss, b=2 max(0, (1- z))2

Adaboost loss e-z

logistic loss log(1+e-z)

square loss (1- z)2

SVC loss, b=1 max(0, 1-z)

0/1 loss

Perceptron loss max(0, -z)

z=y f(x)

missclassified

well classified


Exercise gradient descent

Exercise: Gradient Descent

  • Linear discriminant f(x) = Sj wj xj

  • Functional margin z=y f(x), y=1

  • Compute z/ wj

  • Derive the learning rules Dwj=-h L/wj corresponding to the following loss functions:

SVC loss max(0, 1-z)

Adaboost loss e-z

square loss (1- z)2

logistic loss log(1+e-z)

Perceptron loss max(0, -z)


Exercise dual algorithms

Exercise: Dual Algorithms

  • From the Dwj derive the Dw

  • w = Siaixi

  • From the Dw, derive the Dai of the dual algorithms.


Summary

Summary

  • Modern ML algorithms optimize a penalized risk functional:


Lab 2

Lab 2

Getting started with CLOP


Lab 21

Lab 2

CLOP tutorial


What is clop

What is CLOP?

  • CLOP=Challenge Learning Object Package.

  • Based on the Spider developed at the Max Planck Institute.

  • Two basic abstractions:

    • Data object

    • Model object

  • Put the CLOP directory in your path.

  • At the prompt type: use_spider_clop;

  • If you have used before poly_gui… type

    clear classes


Clop data objects1

CLOP Data Objects

At the Matlab prompt:

  • addpath(<clop_dir>);

  • use_spider_clop;

  • X=rand(10,8);

  • Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';

  • D=data(X,Y); % constructor

  • [p,n]=get_dim(D)

  • get_x(D)

  • get_y(D)


Clop model objects1

CLOP Model Objects

D is a data object previously defined.

  • model = kridge; % constructor

  • [resu, model] = train(model, D);

  • resu, model.W, model.b0

  • Yhat = D.X*model.W' + model.b0

  • testD = data(rand(3,8), [-1 -1 1]');

  • tresu = test(model, testD);

  • balanced_errate(tresu.X, tresu.Y)


Hyperparameters and chains

Hyperparameters and Chains

A model often has hyperparameters:

  • default(kridge)

  • hyper = {'degree=3', 'shrinkage=0.1'};

  • model = kridge(hyper);

  • model = chain({standardize,kridge(hyper)});

  • [resu, model] = train(model, D);

  • tresu = test(model, testD);

  • balanced_errate(tresu.X, tresu.Y)

Models can be chained:


Hyper parameters

Hyper-parameters

  • Kernel methods: kridge and svc:

    k(x, y) = (coef0 + xy)degree exp(-gamma ||x - y||2)

    kij = k(xi, xj)

    kii kii + shrinkage

  • Naïve Bayes: naive: none

  • Neural network: neural

    units, shrinkage, maxiter

  • Random Forest: rf (windows only)

    mtry


Exercise

Exercise

  • Here some the pattern recognition CLOP objects:

    @rf @naive [use spider @svm]

    @svc @neural

    @[email protected]

    @[email protected]

    @[email protected]

  • Try at the prompt example(neural)

  • Try other pattern recognition objects

  • Try different sets of hyperparameters, e.g., example(kridge({'gamma=1', 'shrinkage=0.001'}))

  • Remember: use default(method) to get the HP.


Lab 22

Lab 2

Example:

Digit Recognition

Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge


Data x y

data(X, Y)

% Go to the Gisette directory:

  • cd('GISETTE')

    % Load “validation” data:

  • Xt=load('gisette_valid.data');

  • Yt=load('gisette_valid.labels');

    % Create a data object

    % and examine it:

  • Dt=data(Xt, Yt);

  • browse(Dt, 2);

    % Load “training” data (longer):

  • X=load('gisette_train.data');

  • Y=load('gisette_train.labels');

  • [p, n]=get_dim(Dt);

  • D=train(subsample(['p_max=' num2str(p)]), data(X, Y));

  • clear X Y Xt Yt

    % Save for later use:

  • save('gisette', 'D', 'Dt');


Model hyperparam

model(hyperparam)

% Define some hyperparameters:

  • hyper = {'degree=3', 'shrinkage=0.1'};

    % Create a kernel ridge

    % regression model:

  • model = kridge(hyper);

    % Train it and test it:

  • [resu, Model] = train(model, D);

  • tresu = test(Model, Dt);

    % Visualize the results:

  • roc(tresu);

  • idx=find(tresu.X.*tresu.Y<0);

  • browse(get(D, idx), 2);


Exercise1

Exercise

  • Here are some pattern recognition CLOP objects:

    @rf @naive @gentleboost

    @svc @neural @logitboost

    @kridge @lssvm @klogistic

  • Instanciate a model with some hyperparameters (use default(method) to get the HP)

  • Vary the HP and the number of training examples (Hint: use get(D, 1:n) to restrict the data to n examples).


Chain model1 model2

chain({model1, model2,…})

% Combine preprocessing and kernel ridge regression:

  • my_prepro=normalize;

  • model = chain({my_prepro,kridge(hyper)});

    % Combine replicas of a base learner:

  • for k=1:10

  • base_model{k}=neural;

  • end

  • model=ensemble(base_model);

ensemble({model1, model2,…})


Exercise2

Exercise

  • Here are some preprocessing CLOP objects:

    @normalize @standardize @fourier

  • Chain a preprocessing and a model, e.g.,

  • model=chain({fourier, kridge('degree=3')});

  • my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});

  • model=chain({normalize, my_classif});

  • Train, test, visualize the results. Hint: you can browse the preprocessed data:

  • browse(train(standardize, D), 2);


Summary1

Summary

% After creating your complex model, just one command: train

  • model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})});

  • [resu, Model] = train(model, D);

    % After training your complex model, just one command: test

  • tresu = test(Model, Dt);

    % You can use a “cv” object to perform cross-validation:

  • cv_model=cv(model);

  • [resu, Model] = train(cv_model, D);

  • roc(resu);


Lab 3

Lab 3

Getting started with

Feature Selection


Poly gui again

POLY_GUI again…

  • clear classes

  • poly_gui;

  • Check the “Multiplicative updates” (MU) box.

  • Play with the parameters.

  • Try CV

  • Compare with no MU


Lab 31

Lab 3

Exploring

feature selection methods


Re load the gisette data

Re-load the GISETTE data

% Start CLOP:

  • clear classes

  • use_spider_clop;

    % Go to the Gisette directory:

  • cd('GISETTE')

  • load('gisette');


Visualization

Visualization

1) Create a heatmap of the data matrix or a subset:

show(D);

show(get(D,1:10, 1:2:500));

2) Look at individual patterns:

browse(D);

browse(D, 2); % For 2d data

% Display feature positions:

browse(D, 2, [212, 463, 429, 239]);

3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);


Example

Example

  • my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

  • my_classif=svm('optimizer=''andre''');

  • my_classif.algorithm.use_signed_output=0;

  • model=chain({normalize, s2n('f_max=100'), my_classif});

  • [resu, Model] = train(model, D);

  • tresu = test(Model, Dt);

  • roc(tresu);

    % Show the misclassified first

  • [s,idx]=sort(tresu.X.*tresu.Y);

  • browse(get(Dt, idx), 2, Model{2});


Some filters in clop

Some Filters in CLOP

Univariate:

  • @s2n (Signal to noise ratio.)

  • @Ttest (T statistic; similar to s2n.)

  • @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.)

  • @aucfs (ranksum test)

    Multivariate:

  • @relief (no elimination of redundancy)

  • @gs (Gram-Schmidt orthogonalization; complementary features)


Exercise3

Exercise

  • Change the feature selection algorithm

  • Visualize the features

  • What can you say of the various methods?

  • Which one gives the best results for 2, 10, 100 features?

  • Can you improve by changing the preprocessing? (Hint: try @pc_extract)


Lab 32

Lab 3

Feature significance


T test

T-test

m-

m+

P(Xi|Y=1)

P(Xi|Y=-1)

-1

xi

s-

s+

  • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within.

  • Null hypothesis H0: m+ = m-

  • T statistic: If H0 is true,

  • t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.)


Evalution of pval and fdr

Evalution of pval and FDR

  • Ttest object:

    • computes pval analytically

    • FDR~pval*nsc/n

  • probe object:

    • takes any feature ranking object as an argument (e.g. s2n, relief, Ttest)

    • pval~nsp/np

    • FDR~pval*nsc/n


Analytic vs probe

Analytic vs. probe

1

0.9

0.8

0.7

0.6

FDR

0.5

0.4

0.3

0.2

0.1

0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

rank


Example1

Example

  • [resu, FS] = train(Ttest, D);

  • [resu, PFS] = train(probe(Ttest), D);

  • figure('Name', 'pvalue');

  • plot(get_pval(FS, 1), 'r');

  • hold on; plot(get_pval(PFS, 1));

  • figure('Name', 'FDR');

  • plot(get_fdr(FS, 1), 'r');

  • hold on; plot(get_pval(PFS, 1));


Exercise4

Exercise

  • What could explain differences between the pvalue and fdr with the analytic and probe method?

  • Replace Ttest with chain({rmconst('w_min=0'), Ttest})

  • Recompute the pvalue and fdr curves. What do you notice?

  • Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum);

  • Create a model with fnum. Is fnum optimal? Do you get something better with CV?


Lab 33

Lab 3

Local feature selection


Exercise5

Exercise

Consider the 1 nearest neighbor algorithm. We define the following score:

Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.


Exercise6

Exercise

  • Motivate the choice of such a cost function to approximate the generalization error (qualitative answer)

  • How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional?

  • Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)


Relief

Relief

Relief=<Dmiss/Dhit>

Local_Relief= Dmiss/Dhit

nearest hit

Dhit

Dmiss

nearest miss

Dhit

Dmiss


Exercise7

Exercise

  • [resu, FS] = train(relief, D);

  • browse(D, 2,FS, 20);

  • [resu, LFS] = train(local_relief,D);

  • browse(D, 2,LFS, 20);

  • Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”).

  • Do you anticipate such an algorithm to perform better than the non-local version using “relief”?


Epilogue

Epilogue

Becoming a pro and

playing with

other datasets


Some clop objects

Some CLOP objects

Feature selection, pre- and post- processing

Basic learning machines

Compound models


Http clopinet com challenges

http://clopinet.com/challenges/

  • Challenges in

    • Feature selection

    • Performance prediction

    • Model selection

    • Causality

  • Large datasets


Lab 1

  • Class taught at ETH, Zurich, winter 2005

  • Task of the students:

  • Baseline method provided, BER0 performance and n0 features.

  • Get BER<BER0 or BER=BER0 but n<n0.

  • Extra credit for beating best challenge entry.

DEXTER

Size

Type

Features

Training Examples

Validation Examples

Test Examples

Dataset

NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha

8.7 MB

Dense

10000

100

100

700

Arcene

22.5 MB

Dense

5000

6000

1000

6500

Gisette

DEXTERBest BER=3.300.40% - n0=300 (1.5%) – BER0=5%

my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});

my_model=chain({s2n('f_max=300'), normalize, my_classif})

0.9 MB

Sparse integer

20000

300

300

2000

Dexter

GISETTE

ARCENE

4.7 MB

Sparse binary

100000

800

350

800

Dorothea

GISETTEBest BER=1.260.14% - n0=1000 (20%) – BER0=1.80%

my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

my_model=chain({normalize, s2n('f_max=1000'), my_classif});

2.9 MB

Dense

500

2000

600

1800

Madelon

MADELON

ARCENEBest BER= 11.9 1.2 %- n0=1100 (11%) – BER0=14.7%

my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});

my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})

MADELON Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%

my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'});

my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})

DOROTHEA

DOROTHEABest BER=8.540.99% - n0=1000 (1%) – BER0=12.37%

my_model=chain({TP('f_max=1000'), naive, bias});

Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Schneider and Markus Uhr,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.

NIPS 2003 Feature Selection Challenge


Lab 1

SYLVA

Dataset

Dataset

CLOP models selected

CLOP models selected

{sns, std, norm, neural(units=5), bias}

2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}

ADA

ADA

GINA

GINA

6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}

{norm, svc(degree=5, shrinkage=0.01), bias}

HIVA

HIVA

3*{norm,svc(degree=1),bias}

{std, norm, gentleboost(kridge), bias}

NOVA

NOVA

{norm,gentleboost(neural), bias}

5*{norm,gentleboost(kridge),bias}

SYLVA

SYLVA

4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}

{std, norm, neural(units=1), bias}

Dataset

Feature #

Training #

Validation #

Test #

Domain

ADA

ADA

Marketing

48

4147

415

41471

GINA

Digit recognition

970

3153

315

31532

HIVA

Drug discovery

1617

3845

384

38449

HIVA

NOVA

Text classification

16969

1754

175

17537

Ecology

SYLVA

216

13086

1309

130857

NIPS 2006 Model Selection Game

NOVA

First place: Juha Reunanen, cross-indexing-7

Subject: Re: Goalie masksLines: 21Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo.Lori

sns = shift’n’scale, std = standardize, norm = normalize(some details of hyperparameters not shown)

Second place: Hugo Jair Escalante Balderas, BRun2311062

GINA

Proc. IJCNN07, Orlando, FL, Aug, 2007:

PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar

Model Selection and Assessment Using Cross-indexing, Juha Reunanen

sns = shift’n’scale, std = standardize, norm = normalize(some details of hyperparameters not shown)

Note: entry Boosting_1_001_x900 gave better results, but was older.


  • Login