Feature Selection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Feature Selection and Bioinformatics Applications Isabelle Guyon PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Feature Selection and Bioinformatics Applications Isabelle Guyon. Part I. INTRODUCTION. Objectives. Output y. Predictor f( x ). Input x. Reduce the number of features as much as possible without significantly degrading prediction performance.

Download Presentation

Feature Selection and Bioinformatics Applications Isabelle Guyon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Feature selection and bioinformatics applications isabelle guyon

Feature Selection

and

Bioinformatics Applications

Isabelle Guyon


Part i

Part I

INTRODUCTION


Objectives

Objectives

Output y

Predictor f(x)

Input x

  • Reduce the number of features as much as possible without significantly degrading prediction performance.

  • Possibly improve prediction performances.

  • Gain insight.


Applications

Applications

training examples

High

Energy

Physics

Market

Analysis

OCR

HWR

105

Machine

Vision

104

Text

Categorization

103

Genomics

System diagnosis

102

Bioinformatics

10

Proteomics

inputs

10

102

103

104

105


This talk

This talk:

  • Simple is beautiful but some (moderate) sophistication is needed.

  • “Classical statistics” is pessimistic: it advocates the simplest methods to overcome the curse of dimensionality.

  • Modern statistical methods from soft-computing and machine learning provide necessary additional sophistication and still defeat the curse of dimensionality.


Part ii

Part II

PROBLEM STATEMENT


Correlation analysis

Correlation Analysis

{yk}, k=1…num_patients

{xik}, k=1…num_patients

m-

m+

Top 25 positively correlated features (genes)

Top 25 negatively correlated features (genes)

s-

s+

38 training ex. (27 ALL, 11 AML); 34 test ex. (20 ALL, 14 AML).

Golub et al, Science

Vol 286:15 Oct. 1999

{- yk}


Yes but

Yes, but ...

s-

s+

m-

m+

m-

m+

s-

s+


I i d features

I.I.D. Features

6

4

2

0

-2

-4

5

0

-5

-4

-2

0

2

4

6

-5

0

5


I i d features1

I.I.D. Features

5

0

-5

6

4

2

0

-2

-4

-6

-5

0

5

-6

-4

-2

0

2

4

6

m-

m+


Smaller win

Smaller Win

4

2

0

-2

-4

-6

4

2

0

-2

-4

-6

-6

-4

-2

0

2

4

-6

-4

-2

0

2

4


Bigger win

Bigger Win

6

4

2

0

-2

-4

4

2

0

-2

-4

-6

-4

-2

0

2

4

6

-6

-4

-2

0

2

4


Example from real data

Example from Real Data


Explanation

Explanation:

F1: The peak of interest

F2: The best local estimate of the baseline.


Two useless features

Two “Useless” Features

1.5

1

0.5

0

-0.5

1.5

1

0.5

0

-0.5

-0.5

0

0.5

1

1.5

-0.5

0

0.5

1

1.5

Axis projections do not help finding good features.


Higher dimension problem

Higher dimension problem

Even two-d projections may not help finding good features.


Part iv

Part IV

ALGORITHMS


Main goal

Main Goal

Output

Output

Predictor f(x)

- Eliminate useless features (distracters).

- Rank useful features.

- Eliminate redundant features.

- Rank subsets of useful features.

Sub-goals:

Main goal:


Filters and wrappers

Filters and Wrappers

Feature subset

  • Main goal: rank subsets of useful features.

  • Danger of overfitting: Greedy search often works better.

All features

Filter

Predictor

Multiple Feature subsets

All features

Predictor

Wrapper


Nested subset methods

Nested Subset Methods

Nested subset methods perform a greedy search:

- At each step add or remove a single feature to best improve (or least degrade) the cost function.

- Backward elimination:

Start with all features, progressively remove (never add). Example: RFE (Guyon, Weston, et al, 2002.)

- Forward selection:

Start with an empty set, progressively add (never remove). Example: Gram-Schmidt orthogonalization (Stoppiglia et al, 2003, Rivals-Personnaz, 2003.)


Backward elimination rfe

Backward elimination: RFE

Improve (or least degrade) cost function J(t):

  • Exact or approximate difference calculation DJ=J(feat+1)-J(feat).

  • RFE with linear predictor f(x)=w.x+b: eliminate the feature with smallest wi2(Guyon, Weston, et al, 2002.)

  • Zero norm/multiplicative updates (MU): rescale the input with |wi| at each iteration(Weston, Elisseeff et al. 2003.)

  • Non-linear RFE and non-linear MU: estimate (DJ)i ~ aH(i)a.


Forward selection gram schmidt

Forward selection: Gram-Schmidt

Feature ranking

in the context of others

  • Vanilla (linear) GS: At every iteration, project onto null space of features already selected; select feature most correlated with target.

  • Relief (Kira and Rendell, 1992):

  • GS-Relief combination (Guyon, 2003).


Part iv1

Part IV

EXPERIMENTS


Mass spectrometry experiments

Mass Spectrometry Experiments

In collaboration with Biospect Inc., 2003

Data from Cancer Research, Adam, et al, 2002

TOF

- EVMS prostate cancer data: 326 samples (167 cancer, 159 control).

- Preprocessing including m/z 200-10000, baseline removal.

- Split in 3 equal parts and make 3 experiments 2/3 train 1/3 test.

- Fourty-four methods tried.


Method comparison 100 features

Method Comparison: 100 Features

...

Non-linear multivariate > Linear multivariate > Linear univariate


Method comparison 7 features

Method Comparison: 7 Features

...

Non-linear multivariate > Linear multivariate > Linear univariate


Part v

Part V

CONCLUSION


Experimental results

Experimental Results

In spite of the risk of overfitting ...

  • Subset selection methods can outperform single feature ranking by correlation with the target.

  • Non-linear feature selection can outperform linear feature selection.

|

>

>

… in prediction performance and number of features.


Which method works best

Which method works best?

See the results of the NIPS 2003 competition.

Presentation on December 19th.

See also:

JMLR special issue:

www.jmlr.org/papers/special/feature.html

I. Guyon and A. Elisseeff editors, March 2003.

Workshop website:

www.clopinet.com/isabelle/Projects/NIPS2003

Acknowledgements: Masoud Nikravesh


  • Login