- 111 Views
- Uploaded on
- Presentation posted in: General

Lab 1

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Getting started with

Basic Learning Machines

and

the Overfitting Problem

Polynomial regression

- The code implements the ridge regression algorithm: w=argmin Si (1-yi f(xi))2 + g|| w ||2
f(x) = w1 x + w2 x2 + … + wn xn = wxT

x = [x, x2, … , xn]

wT = X+Y

X+= XT(XXT+g)-1=(XTX+ g)-1XT

X=[x(1); x(2); … x(p)] (matrix (p, n))

- The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.):
- LOO error = (1/p) Sk[ rk/1-(XX+)kk ]2

- At the prompt type: poly_gui;
- Vary the parameters. Refrain from hitting “CV”. Explain what happens in the following situations:
- Sample num. << Target degree (small noise)
- Large noise, small sample num
- Target degree << Model degree

- Why is the LOO error sometimes larger than the training and test error?
- Are there local minima in the LOO error? Is the LOO error flat near the optimum?
- Propose ways of getting a better solution.

The poly_gui emulates CLOP objects of type “data”:

- X = rand(10,5)
- Y = rand(10,1)
- D = data(X,Y) % constructor
- methods(D)
- get_X(D)
- get_Y(D)
- plot(D);

poly_ridge is a “model” object.

- P = poly_ridge; h = plot(P);
- D = gene(P); plot(D, h);
- Dt = gene(P);
- [resu, fittedP] = train(P, D);
- mse(resu)
- tresu = test(fittedP, Dt);
- mse(tresu)
- plot(P, h);

Support Vector Machines

x2

f(x)<0

f(x)>0

f(x) = S aiyi k(x, xi)

k SV

x=[x1,x2]

f(x)=0

x1

Boser-Guyon-Vapnik-1992

- At the prompt type: svc_gui;
- The code implements the Support Vector Machine algorithm with kernel
k(s, t) = (1 + s t)q exp -g||s-t||2

- Regularization similar to ridge regression:
Hinge loss: L(xi)=max(0, 1-yi f(xi))b

Empirical risk: Si L(xi)

w=argmin (1/C)||w||2 + Si L(xi)

shrinkage

More loss functions…

L(y, f(x))

Decision boundary

Margin

SVC loss, b=2 max(0, (1- z))2

Adaboost loss e-z

logistic loss log(1+e-z)

square loss (1- z)2

SVC loss, b=1 max(0, 1-z)

0/1 loss

Perceptron loss max(0, -z)

z=y f(x)

missclassified

well classified

- Linear discriminant f(x) = Sj wj xj
- Functional margin z=y f(x), y=1
- Compute z/ wj
- Derive the learning rules Dwj=-h L/wj corresponding to the following loss functions:

SVC loss max(0, 1-z)

Adaboost loss e-z

square loss (1- z)2

logistic loss log(1+e-z)

Perceptron loss max(0, -z)

- From the Dwj derive the Dw
- w = Siaixi
- From the Dw, derive the Dai of the dual algorithms.

- Modern ML algorithms optimize a penalized risk functional:

Getting started with CLOP

CLOP tutorial

- CLOP=Challenge Learning Object Package.
- Based on the Spider developed at the Max Planck Institute.
- Two basic abstractions:
- Data object
- Model object

- Put the CLOP directory in your path.
- At the prompt type: use_spider_clop;
- If you have used before poly_gui… type
clear classes

At the Matlab prompt:

- addpath(<clop_dir>);
- use_spider_clop;
- X=rand(10,8);
- Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';
- D=data(X,Y); % constructor
- [p,n]=get_dim(D)
- get_x(D)
- get_y(D)

D is a data object previously defined.

- model = kridge; % constructor
- [resu, model] = train(model, D);
- resu, model.W, model.b0
- Yhat = D.X*model.W' + model.b0
- testD = data(rand(3,8), [-1 -1 1]');
- tresu = test(model, testD);
- balanced_errate(tresu.X, tresu.Y)

A model often has hyperparameters:

- default(kridge)
- hyper = {'degree=3', 'shrinkage=0.1'};
- model = kridge(hyper);
- model = chain({standardize,kridge(hyper)});
- [resu, model] = train(model, D);
- tresu = test(model, testD);
- balanced_errate(tresu.X, tresu.Y)

Models can be chained:

- Kernel methods: kridge and svc:
k(x, y) = (coef0 + xy)degree exp(-gamma ||x - y||2)

kij = k(xi, xj)

kii kii + shrinkage

- Naïve Bayes: naive: none
- Neural network: neural
units, shrinkage, maxiter

- Random Forest: rf (windows only)
mtry

- Here some the pattern recognition CLOP objects:
@rf @naive [use spider @svm]

@svc @neural

@gentleboost@lssvm

@gkridge@kridge

@klogistic@logitboost

- Try at the prompt example(neural)
- Try other pattern recognition objects
- Try different sets of hyperparameters, e.g., example(kridge({'gamma=1', 'shrinkage=0.001'}))
- Remember: use default(method) to get the HP.

Example:

Digit Recognition

Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge

% Go to the Gisette directory:

- cd('GISETTE')
% Load “validation” data:

- Xt=load('gisette_valid.data');
- Yt=load('gisette_valid.labels');
% Create a data object

% and examine it:

- Dt=data(Xt, Yt);
- browse(Dt, 2);
% Load “training” data (longer):

- X=load('gisette_train.data');
- Y=load('gisette_train.labels');
- [p, n]=get_dim(Dt);
- D=train(subsample(['p_max=' num2str(p)]), data(X, Y));
- clear X Y Xt Yt
% Save for later use:

- save('gisette', 'D', 'Dt');

% Define some hyperparameters:

- hyper = {'degree=3', 'shrinkage=0.1'};
% Create a kernel ridge

% regression model:

- model = kridge(hyper);
% Train it and test it:

- [resu, Model] = train(model, D);
- tresu = test(Model, Dt);
% Visualize the results:

- roc(tresu);
- idx=find(tresu.X.*tresu.Y<0);
- browse(get(D, idx), 2);

- Here are some pattern recognition CLOP objects:
@rf @naive @gentleboost

@svc @neural @logitboost

@kridge @lssvm @klogistic

- Instanciate a model with some hyperparameters (use default(method) to get the HP)
- Vary the HP and the number of training examples (Hint: use get(D, 1:n) to restrict the data to n examples).

% Combine preprocessing and kernel ridge regression:

- my_prepro=normalize;
- model = chain({my_prepro,kridge(hyper)});
% Combine replicas of a base learner:

- for k=1:10
- base_model{k}=neural;
- end
- model=ensemble(base_model);

ensemble({model1, model2,…})

- Here are some preprocessing CLOP objects:
@normalize @standardize @fourier

- Chain a preprocessing and a model, e.g.,
- model=chain({fourier, kridge('degree=3')});
- my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});
- model=chain({normalize, my_classif});
- Train, test, visualize the results. Hint: you can browse the preprocessed data:
- browse(train(standardize, D), 2);

% After creating your complex model, just one command: train

- model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})});
- [resu, Model] = train(model, D);
% After training your complex model, just one command: test

- tresu = test(Model, Dt);
% You can use a “cv” object to perform cross-validation:

- cv_model=cv(model);
- [resu, Model] = train(cv_model, D);
- roc(resu);

Getting started with

Feature Selection

- clear classes
- poly_gui;
- Check the “Multiplicative updates” (MU) box.
- Play with the parameters.
- Try CV
- Compare with no MU

Exploring

feature selection methods

% Start CLOP:

- clear classes
- use_spider_clop;
% Go to the Gisette directory:

- cd('GISETTE')
- load('gisette');

1) Create a heatmap of the data matrix or a subset:

show(D);

show(get(D,1:10, 1:2:500));

2) Look at individual patterns:

browse(D);

browse(D, 2); % For 2d data

% Display feature positions:

browse(D, 2, [212, 463, 429, 239]);

3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);

- my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
- my_classif=svm('optimizer=''andre''');
- my_classif.algorithm.use_signed_output=0;
- model=chain({normalize, s2n('f_max=100'), my_classif});
- [resu, Model] = train(model, D);
- tresu = test(Model, Dt);
- roc(tresu);
% Show the misclassified first

- [s,idx]=sort(tresu.X.*tresu.Y);
- browse(get(Dt, idx), 2, Model{2});

Univariate:

- @s2n (Signal to noise ratio.)
- @Ttest (T statistic; similar to s2n.)
- @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.)
- @aucfs (ranksum test)
Multivariate:

- @relief (no elimination of redundancy)
- @gs (Gram-Schmidt orthogonalization; complementary features)

- Change the feature selection algorithm
- Visualize the features
- What can you say of the various methods?
- Which one gives the best results for 2, 10, 100 features?
- Can you improve by changing the preprocessing? (Hint: try @pc_extract)

Feature significance

m-

m+

P(Xi|Y=1)

P(Xi|Y=-1)

-1

xi

s-

s+

- Normally distributed classes, equal variance s2 unknown; estimated from data as s2within.
- Null hypothesis H0: m+ = m-
- T statistic: If H0 is true,
- t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.)

- Ttest object:
- computes pval analytically
- FDR~pval*nsc/n

- probe object:
- takes any feature ranking object as an argument (e.g. s2n, relief, Ttest)
- pval~nsp/np
- FDR~pval*nsc/n

1

0.9

0.8

0.7

0.6

FDR

0.5

0.4

0.3

0.2

0.1

0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

rank

- [resu, FS] = train(Ttest, D);
- [resu, PFS] = train(probe(Ttest), D);
- figure('Name', 'pvalue');
- plot(get_pval(FS, 1), 'r');
- hold on; plot(get_pval(PFS, 1));
- figure('Name', 'FDR');
- plot(get_fdr(FS, 1), 'r');
- hold on; plot(get_pval(PFS, 1));

- What could explain differences between the pvalue and fdr with the analytic and probe method?
- Replace Ttest with chain({rmconst('w_min=0'), Ttest})
- Recompute the pvalue and fdr curves. What do you notice?
- Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum);
- Create a model with fnum. Is fnum optimal? Do you get something better with CV?

Local feature selection

Consider the 1 nearest neighbor algorithm. We define the following score:

Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.

- Motivate the choice of such a cost function to approximate the generalization error (qualitative answer)
- How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional?
- Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)

Relief=<Dmiss/Dhit>

Local_Relief= Dmiss/Dhit

nearest hit

Dhit

Dmiss

nearest miss

Dhit

Dmiss

- [resu, FS] = train(relief, D);
- browse(D, 2,FS, 20);
- [resu, LFS] = train(local_relief,D);
- browse(D, 2,LFS, 20);

- Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”).
- Do you anticipate such an algorithm to perform better than the non-local version using “relief”?

Becoming a pro and

playing with

other datasets

Feature selection, pre- and post- processing

Basic learning machines

Compound models

- Challenges in
- Feature selection
- Performance prediction
- Model selection
- Causality

- Large datasets

- Class taught at ETH, Zurich, winter 2005
- Task of the students:
- Baseline method provided, BER0 performance and n0 features.
- Get BER<BER0 or BER=BER0 but n<n0.
- Extra credit for beating best challenge entry.

DEXTER

Size

Type

Features

Training Examples

Validation Examples

Test Examples

Dataset

NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha

8.7 MB

Dense

10000

100

100

700

Arcene

22.5 MB

Dense

5000

6000

1000

6500

Gisette

DEXTERBest BER=3.300.40% - n0=300 (1.5%) – BER0=5%

my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});

my_model=chain({s2n('f_max=300'), normalize, my_classif})

0.9 MB

Sparse integer

20000

300

300

2000

Dexter

GISETTE

ARCENE

4.7 MB

Sparse binary

100000

800

350

800

Dorothea

GISETTEBest BER=1.260.14% - n0=1000 (20%) – BER0=1.80%

my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

my_model=chain({normalize, s2n('f_max=1000'), my_classif});

2.9 MB

Dense

500

2000

600

1800

Madelon

MADELON

ARCENEBest BER= 11.9 1.2 %- n0=1100 (11%) – BER0=14.7%

my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});

my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})

MADELON Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%

my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'});

my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})

DOROTHEA

DOROTHEABest BER=8.540.99% - n0=1000 (1%) – BER0=12.37%

my_model=chain({TP('f_max=1000'), naive, bias});

Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Schneider and Markus Uhr,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.

NIPS 2003 Feature Selection Challenge

SYLVA

Dataset

Dataset

CLOP models selected

CLOP models selected

{sns, std, norm, neural(units=5), bias}

2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}

ADA

ADA

GINA

GINA

6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}

{norm, svc(degree=5, shrinkage=0.01), bias}

HIVA

HIVA

3*{norm,svc(degree=1),bias}

{std, norm, gentleboost(kridge), bias}

NOVA

NOVA

{norm,gentleboost(neural), bias}

5*{norm,gentleboost(kridge),bias}

SYLVA

SYLVA

4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}

{std, norm, neural(units=1), bias}

Dataset

Feature #

Training #

Validation #

Test #

Domain

ADA

ADA

Marketing

48

4147

415

41471

GINA

Digit recognition

970

3153

315

31532

HIVA

Drug discovery

1617

3845

384

38449

HIVA

NOVA

Text classification

16969

1754

175

17537

Ecology

SYLVA

216

13086

1309

130857

NIPS 2006 Model Selection Game

NOVA

First place: Juha Reunanen, cross-indexing-7

Subject: Re: Goalie masksLines: 21Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo.Lori

sns = shift’n’scale, std = standardize, norm = normalize(some details of hyperparameters not shown)

Second place: Hugo Jair Escalante Balderas, BRun2311062

GINA

Proc. IJCNN07, Orlando, FL, Aug, 2007:

PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar

Model Selection and Assessment Using Cross-indexing, Juha Reunanen

sns = shift’n’scale, std = standardize, norm = normalize(some details of hyperparameters not shown)

Note: entry Boosting_1_001_x900 gave better results, but was older.