slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory PowerPoint Presentation
Download Presentation
Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory

Loading in 2 Seconds...

play fullscreen
1 / 112

Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Molecular Signaling & Drug Development Course: Development of Molecular Signatures from High-Throughput Assay Data. Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory Benchmarking Director, Best Practices Integrative Informatics Consultation Service

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory' - amaya-contreras


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Molecular Signaling & Drug Development Course:Development of Molecular Signatures from High-Throughput Assay Data

Alexander Statnikov, Ph.D.

Director, Computational Causal Discovery Laboratory

Benchmarking Director, Best Practices Integrative Informatics Consultation Service

Assistant Professor, Department of Medicine, Division of Clinical Pharmacology

Center for Health Informatics and Bioinformatics , NYU School of Medicine

5/16/2011

outline
Outline
  • Part 1: Introduction to molecular signatures
  • Part 2: Key principles for developing accurate molecular signatures
  • Part 3: Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification
  • Part 4: Analysis and computational dissection of molecular signature multiplicity
  • Conclusion
  • Homework assignment
definition of a molecular signature
Definition of a molecular signature

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

fda view on molecular signatures
FDA view on molecular signatures

The FDA calls them “in vitro diagnostic multivariate index assays”

1. “Class II Special Controls Guidance Document: Gene Expression Profiling Test System for Breast Cancer Prognosis”:

  • Addresses device classification.

2. “The Critical Path to New Medical Products”:

- Identifies pharmacogenomics as crucial to advancing medical product development and personalized medicine.

3. “Draft Guidance on Pharmacogenetic Tests and Genetic Tests for Heritable Markers” & “Guidance for Industry: Pharmacogenomic Data Submissions”

  • Identifies 3 main goals (dose, ADEs, responders),
  • Defines IVDMIA,
  • Encourages “fault-free” sharing of pharmacogenomic data,
  • Separates “probable” from “valid” biomarkers,
  • Focuses on genomics (and not other omics).
main uses of molecular signatures
Main uses of molecular signatures
  • Direct benefits: Models of disease phenotype/clinical outcome
    • Diagnosis
    • Prognosis, long-term disease management
    • Personalized treatment (drug selection, titration)
  • Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction
    • Make the above tasks resource efficient, and easy to use in clinical practice
    • Helps next-generation molecular imaging
    • Leads for potential new drug candidates
  • Ancillary benefits 2: Discovery of structure & mechanisms(regulatory/interaction networks, pathways, sub-types)
    • Leads for potential new drug candidates
less conventional uses of molecular signatures
Less conventional uses of molecular signatures
  • Increase clinical trial sample efficiency and decrease costs or both, using placebo responder signatures;
  • In silico signature-based candidate drug screening;
  • Drug “resurrection”;
  • Establishing existence of biological signal in very small sample situations where univariate signals are too weak;
  • Assess importance of markers and of mechanisms involving those;
  • Choosing the right animal model;
  • …?
slide8

Recent molecular signatures

available for patient care

Agendia

Clarient

Prediction Sciences

LabCorp

University Genomics

Genomic Health

Veridex

BioTheranostics

Applied Genomics

Power3

OvaSure

Correlogic Systems

slide10

MammaPrint

• Developed by Agendia (www.agendia.com)

• 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease

• Independently validated in >1,000 patients

• So far performed >10,000 tests

• Cost of the test is ~$3,000

• In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm.

• TIME Magazine’s 2007 “medical invention of the year”.

oncotype dx
Oncotype DX

• Developed by Genomic Health (www.genomichealth.com )

• 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse

• Independently validated in >1,000 patients

• So far performed >50,000 tests

• Cost of the test is ~$3,000

• The following paper shows the health benefits and cost-effectiveness benefits of using Oncotype DX: http://www3.interscience.wiley.com/cgi-bin/abstract/114124513/ABSTRACT

main ingredients for developing a molecular signature
Main ingredients for developing a molecular signature

Well-defined

clinical problem &

access to patients/

samples

Computational &

biostatistical

Analysis

Molecular Signature

High-throughput

assays

challenges in computational analysis of omics data
Challenges in computational analysis of omics data
  • Relatively easy to develop a predictive model and even easier to believe that a model is good when it is not  false sense of security
  • Several problems exist: some theoretical and some practical
  • Omics data has many special characteristics and is tricky to analyze!
example ovacheck 1 2
Example: OvaCheck (1/2)
  • Developed by Correlogic (www.correlogic.com)
  • Blood test for the early detection of epithelial ovarian cancer 
  • Failed to obtain FDA approval
  • Looks for subtle changes in patterns among the tens of thousands of proteins, protein fragments and metabolites in the blood
  • Signature developed by genetic algorithm
  • Significant artifacts in data collection & analysis questioned validity of the signature:
    • Results are not reproducible
    • Data collected differently for different groups of patients

http://www.nature.com/nature/journal/v429/n6991/full/429496a.html

slide16

Example: OvaCheck (2/2)

A

B

C

Figure from

Baggerly et al

(Bioinformatics, 2004)

D

E

F

e g for classification predict response to treatment
E.g., for classification (predict response to treatment)

p53

Respond to treatment Tx1

Do not

Respond to treatment Tx1

Rb

another use of clustering
Another use of clustering
  • Cluster genes (instead of patients):
    • Genes that cluster together may belong to the same pathways
    • Genes that cluster apart may be unrelated
slide21
Unfortunately clustering is a non-specific method and falls into the ‘one-solution fits all’ trap when used for classification

p53

Squamous carcinoma

Adenocarcinoma

Rb

clustering is also non specific when used to discover pathways or other mechanistic relationships
Clustering is also non-specific when used to discover pathways, or other mechanistic relationships

It is entirely possible in this simple illustrative counter-example for G3 (a causally unrelated gene to the phenotype) to be more strongly associated and thus cluster with the phenotype (or its surrogate genes) than the true causal oncogenes G1, G2

G1

G2

Ph

G3

two improved classes of methods
Two improved classes of methods
  • Supervised learning classification/molecular signatures and markers
  • Regulatory network reverse engineering  pathways
slide24

Supervised learning: Use the known phenotypes (a.k.a. “class labels”) in training data to build signatures or find markers highly specific for that phenotype

A

Classifier/

Regression

Algorithm

Training

samples

B

C

Molecular signature

D

T

Testing/

Validation

samples

A1, B1, C1, D1, T1

A2, B2, C2, D2, T2

An, Bn, Cn, Dn, Tn

Classification

Performance

slide25

Input data for supervised learning methods

Class LabelVariables/features

Primary

Metastatic

Primary

Metastatic

Metastatic

Primary

Metastatic

Metastatic

Metastatic

Primary

Metastatic

Primary

slide26

Principles and geometric representation for supervised learning (1/7)

  • Want to classify objects as boats and houses.
slide27

Principles and geometric representation for supervised learning (2/7)

  • All objects before the coast line are boats and all objects after the coast line are houses.
  • Coast line serves as a decision surface that separates two classes.
slide28

Principles and geometric representation for supervised learning (3/7)

These boats will be misclassified as houses

This house will be misclassified as boat

slide29

Principles and geometric representation for supervised learning (4/7)

Longitude

Boat

House

Latitude

  • The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example.
  • First all objects are represented geometrically.
slide30

Principles and geometric representation for supervised learning (5/7)

Longitude

Boat

House

Latitude

Then the algorithm seeks to find a decision surface that separates classes of objects

slide31

Principles and geometric representation for supervised learning (6/7)

Longitude

These objects are classified as houses

?

?

?

?

?

?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

slide32

Principles and geometric representation for supervised learning (7/7)

Longitude

Object #1

Object #2

Object #3

Latitude

in 2 d this looks simple but what happens in higher dimensional data
In 2-D this looks simple but what happens in higher dimensional data…
  • 10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays)
  • >500,000 (tiled microarrays, SNP arrays)
  • 10,000-300,000 (regular MS proteomics)
  • >10,000,000 (LC-MS proteomics)
  • >100,000,000 (next-generation sequencing)

This is the ‘curse of dimensionality problem’

high dimensionality especially with small samples causes
High-dimensionality (especially with small samples) causes:
  • Some methods do not run at all (classical regression)
  • Some methods give bad results (KNN, Decision trees)
  • Very slow analysis
  • Very expensive/cumbersome clinical application
  • Tends to “overfit”
two problems over fitting under fitting
Two problems: Over-fitting & Under-fitting
  • Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data
  • Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data
slide36
Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
slide37
Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Outcome of Interest Y

This line is good!

This line overfits!

Training Data

Future Data

Predictor X

slide38
Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Outcome of Interest Y

This line is good!

This line underfits!

Training Data

Future Data

Predictor X

very important concept
Very important concept…
  • Successful data analysis methods balance training data fit with complexity.
    • Too complex signature (to fit training data well) overfitting (i.e., signature does not generalize)
    • Too simplistic signature (to avoid overfitting)  underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small).
the support vector machine svm approach for building molecular signatures
The Support Vector Machine (SVM) approach for building molecular signatures
  • Support vector machines (SVMs) is a binary classification algorithm.
  • SVMs are important because of (a) theoretical reasons:
    • Robust to very large number of variables and small samples
    • Can learn both simple and highly complex classification models
    • Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

main ideas of svms 1 3
Main ideas of SVMs (1/3)

Gene Y

Normal patients

Cancer patients

Gene X

  • Consider example dataset described by 2 genes, gene X and gene Y
  • Represent patients geometrically (by “vectors”)
main ideas of svms 2 3
Main ideas of SVMs (2/3)

Gene Y

Gap

Normal patients

Cancer patients

Gene X

  • Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);
main ideas of svms 3 3
Main ideas of SVMs (3/3)
  • If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found;
  • The feature space is constructed via very clever mathematical projection (“kernel trick”).
on estimation of signature accuracy
On estimation of signature accuracy

data

test

train

Large sample case:

use hold-out validation

data

train

test

train

train

train

train

train

Small sample case: use N-fold cross-validation

test

test

test

test

test

nested n fold cross validation

train

test

train

train

What combination of learner parameters to apply on training data?

test

test

train

train

valid

train

valid

Perform “grid search” using another nested loop of cross-validation.

Nested N-fold cross-validation

Recall the main idea of cross-validation:

data

slide46
Overview of challenges in computational analysis of omics data for development of molecular signatures

Rashomon effect/

Marker multiplicity

Assay validity/

reproducibility

Research Designs

Efficiency: Statistical/

Computational

Is there

predictive signal?

Data Analytics

of

Molecular

Signatures

Causality vs predictiveness/

Biological Significance

Methods Development:

Re-inventing the wheel &

specialization

Epistasis

Many variables,

small sample,

noise, artifacts

Instability

Performance:

Predictivity,

compactness

Protocols/Guidelines

Editorializing/

Over-simplifying/

Sensationalism

slide47
Part 3:Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification
comprehensive evaluation of algorithms for classification of cancer microarray data
Comprehensive evaluation of algorithms for classification of cancer microarray data
  • Main goals:
  • Find the best performing algorithms for building molecular signatures for cancer diagnosis from microarray gene expression data;
  • Investigate benefits of using gene selection and ensemble classification methods.
classification algorithms
Classification algorithms
  • K-Nearest Neighbors (KNN)
  • Backpropagation Neural Networks (NN)
  • Probabilistic Neural Networks (PNN)
  • Multi-Class SVM: One-Versus-Rest (OVR)
  • Multi-Class SVM: One-Versus-One (OVO)
  • Multi-Class SVM: DAGSVM
  • Multi-Class SVM by Weston & Watkins (WW)
  • Multi-Class SVM by Crammer & Singer (CS)
  • Weighted Voting: One-Versus-Rest
  • Weighted Voting: One-Versus-One
  • Decision Trees: CART

instance-based

neural

networks

kernel-based

voting

decision trees

ensemble classification methods

dataset

Classifier 1

Classifier 2

Classifier N

Prediction 1

Prediction 2

Prediction N

dataset

Ensemble

Classifier

Final

Prediction

Ensemble classification methods
gene selection methods

Highly discriminatory genes

Uninformative genes

genes

Gene selection methods
  • Signal-to-noise (S2N) ratio in one-versus-rest (OVR) fashion;
  • Signal-to-noise (S2N) ratio in one-versus-one (OVO) fashion;
  • Kruskal-Wallis nonparametric one-way ANOVA (KW);
  • Ratio of genes between-categories to within-category sum of squares (BW).
performance metrics and statistical comparison
Performance metrics andstatistical comparison
  • Accuracy
      • + can compare to previous studies
      • + easy to interpret & simplifies statistical comparison
  • 2. Relative classifier information (RCI)
  • + easy to interpret & simplifies statistical comparison
  • + not sensitive to distribution of classes
  • + accounts for difficulty of a decision problem
  • Randomized permutation testing to compare accuracies
  • of the classifiers (=0.05)
microarray datasets
Microarray datasets
  • Total:
  • ~1300 samples
  • 74 diagnostic categories
  • 41 cancer types and
  • 12 normal tissue types
summary of methods and datasets

Cross-Validation

Designs (2)

Performance

Metrics (2)

Statistical

Comparison

GeneSelection

Methods (4)

Accuracy

10-Fold CV

S2N One-Versus-Rest

Randomized

permutation testing

RCI

LOOCV

S2N One-Versus-One

Non-param. ANOVA

GeneExpressionDatasets (11)

Classifiers (11)

BW ratio

11_Tumors

One-Versus-Rest

14_Tumors

One-Versus-One

EnsembleClassifiers (7)

MC-SVM

9_Tumors

DAGSVM

Brain Tumor1

Method by WW

Majority Voting

Multicategory Dx

Brain_Tumor2

Method by CS

MC-SVM OVR

Based on MC-

SVM outputs

Leukemia1

KNN

MC-SVM OVO

Leukemia2

Backprop. NN

MC-SVM DAGSVM

Lung_Cancer

Prob. NN

Decision Trees

SRBCT

Decision Trees

Majority Voting

Based on outputs

of all classifiers

Prostate_Tumors

Binary Dx

One-Versus-Rest

WV

Decision Trees

DLBCL

One-Versus-One

Summary of methods and datasets
results without gene selection

100

80

60

Accuracy, %

40

20

OVR

0

OVO

MC-SVM

DAGSVM

WW

CS

KNN

NN

PNN

Results without gene selection
results with gene selection

70

9_Tumors

14_Tumors

100

60

80

50

60

40

40

Improvement in accuracy, %

20

30

Accuracy, %

Brain_Tumor1

Brain_Tumor2

100

20

80

10

60

40

OVR

OVO

DAGSVM

WW

CS

KNN

NN

PNN

20

SVM

non-SVM

SVM

non-SVM

SVM

non-SVM

Results with gene selection

Improvement of diagnostic performance by gene selection (averages for the four datasets)

Diagnostic performance

before and after gene selection

Average reduction of genes is 10-30 times

comparison with previously published results

Multiclass SVMs

(this study)

100

80

60

Multiple specialized

classification methods

(original primary studies)

Accuracy, %

40

20

0

Comparison with previously published results
summary of results
Summary of results
  • Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV.
  • Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms;
  • Ensemble classification does not improve performance of SVM and other classifiers;
  • Results obtained by SVMs favorably compare with the literature.
slide59

Random Forest (RF) classifiers

  • Appealing properties
    • Work when # of predictors > # of samples
    • Embedded gene selection
    • Incorporate interactions
    • Based on theory of ensemble learning
    • Can work with binary & multiclass tasks
    • Does not require much fine-tuning of parameters
  • Strong theoretical claims
  • Empirical evidence: (Diaz-Uriarte and Alvarez de Andres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared to SVMs and other methods
key principles of rf classifiers
Key principles of RF classifiers

Testing

data

Training

data

1) Generate bootstrap samples

4) Apply to testing data & combine predictions

2) Random gene

selection

3) Fit unpruned decision trees

results without gene selection1
Results without gene selection
  • SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets.
  • In 7 datasets SVMs outperform RFs statistically significantly.
  • On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI.
results with gene selection1
Results with gene selection
  • SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets.
  • In 1 dataset SVMs outperform RFs statistically significantly.
  • On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI.
molecular signature multiplicity
Molecular signature multiplicity
  • Different methods or samples from the same population lead to different but apparently maximally predictive signatures;
  • Far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments:
    • Generation of biological hypotheses is very hard even when signatures are maximally predictive of the phenotype since thousands of completely different signatures are equally consistent with the data;
    • Produced signatures are not statistically generalizable to new cases, and thus not reliable enough for translation to clinical practice.
molecular signature multiplicity1
Molecular signature multiplicity
  • Causes of this phenomenon are unknown; several contradictory conjectures exist in the field:
    • Signature multiplicity is due to small samples [Michiels et al., 2005]
    • Signature multiplicity leads to predictively non-reproducible signatures [Ein-Dor et al., 2006]; building reproducible signatures requires thousands of samples [Ioannidis, 2005]
    • Signature multiplicity is a by-product of the complex regulatory connectivity of genome [Dougherty and Brun, 2006]
    • Artifacts of data pre-processing, e.g. normalization [Gold et al., 2005; Qiu et al., 2005; Ploner et al., 2005]
major goals
Major goals
  • Develop a Markov boundary characterization of molecular signature multiplicity phenomenon;
  • Design and study algorithms that can correctly identify the set of maximally predictive and non-redundant molecular signatures;
  • Conduct an empirical evaluation of the novel algorithms and compare to the existing state-of-the-art methods;
  • Test and refine previously stated hypotheses about the causes of signature multiplicity phenomenon.
optimality criteria of signatures
Optimality criteria of signatures

Signatures that are focus of this research satisfy the following two optimality criteria:

  • maximally predictive of the phenotype (they achieve best predictivity of the phenotype in the given dataset over all signatures based on different gene sets);
  • do not contain predictively redundant genes (i.e., genes that can be removed from the signature without adversely affecting its predictivity).
why do we need algorithms to extract as many optimal signatures as possible
Why do we need algorithms to extract as many optimal signatures as possible?
  • A deeper understanding of the signature multiplicity phenomenon and how it affects reproducibility of signatures;
  • Improving discovery of the underlying biological mechanisms by not missing genes that are implicated biologically in disease processes;
  • Catalyzing regulatory approval by establishing in-silico equivalence to previously validated signatures
existing algorithms for multiple signature extraction resampling based methods
Existing algorithms for multiple signature extraction: Resampling-based methods

Training

data

  • Based on assumption that multiplicity is strictly a small-sample phenomenon;
  • An infinite number of resamplings is required to extract all optimal signatures;
  • May stop producing multiple signatures in large sample sizes.

1) Generate resampled

datasets (e.g., by bootstrapping)

2) Apply a standard signature extraction algorithm (e.g., SVM-RFE)

X1

X2

X3

XN

existing algorithms for multiple signature extraction iterative removal
Existing algorithms for multiple signature extraction: Iterative removal

Original data (for all genes)

Remove corresponding genes from the dataset

X1

Reduced data (excluding X1 genes)

Remove corresponding genes from the dataset

X2

Reduced data (excluding X1 and X2 genes)

Remove corresponding genes from the dataset

X3

…until a signature has statistically significantly reduced predictivity

  • Agnostic to what causes molecular signature multiplicity;
  • Cannot discover signatures that have genes in common.
existing algorithms for multiple signature extraction stochastic gene selection
Existing algorithms for multiple signature extraction: Stochastic gene selection

Genetic Algorithms (e.g., GA/KNN or GA/SVM)

  • Can output all signatures that are discoverable by a genetic algorithm when it is allowed to evolve an infinite number of generations.

KIAMB

  • Stochastic Markov boundary method based on IAMB algorithm;
  • In a specific class of distributions, every optimal signature will be output by this method with nonzero probability;
  • Requires an infinite number of iterations to discover all optimal signatures; will discover same signature over and over again;
  • Sample requirements are of exponential order to the number of genes in a signatures.
existing algorithms for multiple signature extraction brute force exhaustive search
Existing algorithms for multiple signature extraction: Brute-force exhaustive search

LIKNON

  • Examines predictivity of all individual genes in the dataset, all pairs of genes, all triples of genes, and so on;
  • It is infeasible when a signature has more than 2-3 genes;
  • Agnostic to what causes signature multiplicity.

In summary, no current algorithm provides a systematic and efficient approach for identification of the set of maximally predictive and non-redundant molecular signatures that exist in the underlying distribution.

key definitions 1 2
Key definitions (1/2)
  • Definition of maximally predictive molecular signature:A maximally predictive molecular signature is a molecular signature that maximizes predictivity of the phenotype relative to all other signatures that can be constructed from the same dataset.
  • Definition of maximally predictive and non-redundant molecular signature:A maximally predictive and non-redundant molecular signature based on variables X is a maximally predictive signature such that any signature based on a proper subset of variables in X is not maximally predictive.
key definitions 2 2
Key definitions (2/2)
  • Definition ofMarkov blanket: A Markov blanket Mof the response variable TV in the joint probability distribution P over variables V is a set of variables conditioned on which all other variables are independent of T, i.e. for every ,

.

  • Definition of Market boundary (or non-redundant Markov blanket): If M is a Markov blanket of T and no proper subset of M satisfies the definition of Markov blanket of T, then M is called a Markov boundary (or non-redundant Markov blanket) of T.
theoretical results
Theoretical results
  • Variable sets that participate in the maximally predictive signatures of T are precisely the Markov blankets of T and vice-versa;
  • Similarly, variable sets that participate in the maximally predictive and non-redundant signatures of T are precisely the Markov boundaries of T and vice-versa;
  • If a joint probability distribution P over variables V satisfies the intersection property*, then there exists a unique Markov boundary of T[Pearl, 1988].
a fundamental reduction used in this research for the analysis of signatures
A fundamental reduction used in this research for the analysis of signatures

S1

S2

S3

S4

S5

Cases

Gene Y

Controls

*

*

*

+

+

*

*

+

*

*

*

+

*

*

*

*

*

*

+

*

*

+

*

+

*

*

*

+

*

+

*

*

+

+

+

*

*

*

+

+

+

Signatures that have maximal predictivity of the phenotype relative to their genes.

Signatures with worse predictivity

Gene X

  • Since there is an infinite number of signatures with maximal predictivity, when I refer to a signature, I mean one of the predictively equivalent classifiers (e.g., S3 or S4 or S5);
  • Can study signature classes by reference only to their genes;
  • This reduction is justified whenever the classifiers used can learn the minimum error decision function given sufficient sample.
example of markov boundary multiplicity
Example of Markov boundary multiplicity

Network structure

Distributional constraints

Many optimal signatures exist: e.g., {A, C} and {B, C} are maximally predictive and non-redundant signatures of T. Furthermore, {A, C} and {B, C} remain maximally predictive even in infinite samples;

The network has very low connectivity;

Genes in optimal signatures do not have to be deterministically related: e.g., A and B are not deterministically related, yet convey individually the same information about T;

If an algorithm selects only one optimal signature, then there is danger to miss biologically important causative genes;

The union of all optimal signatures includes all genes located in the local pathway around T;

In this example the intersection of all optimal signatures contains only genes in the local pathway around T.

slide79
II. A Novel algorithm to correctly identify the set of maximally predictive and non-redundant signatures
trace of the tie algorithm
Trace of the TIE* algorithm

Not a Markov boundary;

Do not consider any G that

is a superset of {F}

G={F}

Mnew = {A, B}

G={A}

Mnew = {C, B, F}

Markov boundary

M = {A, B, F}

G={B}

Mnew = {A, D, E, F}

Markov boundary

Mnew = {C, D, E, F}

Markov boundary

G={A,B}

theoretical results 1 2
Theoretical results (1/2)
  • TIE* returns all and only Markov boundaries of T (i.e., maximally predictive and non-redundant signatures) if its input components X, Y, Z are admissible
  • IAMB is an admissible Markov boundary algorithm (input component X) under assumptions
    • IAMB correctly outputs a Markov boundary if only the composition property holds
  • HITON-PC is an admissible Markov boundary algorithm (input component X) under assumptions
    • HITON-PC correctly outputs a Markov boundary if the adjacency faithfulness assumption holds except for violations of the intersection axiom, global Markov condition holds, and there are no “spouses” in the Markov boundary
theoretical results 2 2
Theoretical results (2/2)
  • Stated three strategies (IncLex, IncMinAssoc, and IncMaxAssoc) to generate subsets of variables that have to be removed from V to identify new Markov boundaries of T and proved their admissibility (input component Y)
  • Stated two criteria (Independence and Predictivity)to verify Markov boundaries and proved their admissibility (input component Z).
slide85
III. Empirical evaluation of the novel algorithms and comparison with existing state-of-the-art methods
a experiments with artificial simulated data
A. Experiments with artificial simulated data

Generative model is available, and the set of Markov boundaries (and thus the set of maximally predictive and non-redundant signatures) is known.

  • Generate samples of systematically varied sizes;
  • Compare to the gold standard;
  • Test whether the TIE* algorithm behaves according to theoretical expectations and study its empirical properties;
  • Obtain clues about behavior of TIE* and baseline comparison algorithms in experiments with real gene expression data.
experiments with discrete networks tied1 and tied2
Experiments with discrete networks TIED1 and TIED2
  • Two artificial discrete networks were created:
  • TIED1 consists of 30 variables (including a response variable T) and contains 72 Markov boundaries of T;
  • TIED2 consists of 1,000 variables (including a response variable T) and contains the same 72 Markov boundaries of T as TIED1.
experiments
Experiments
  • Goal:Compare TIE* to state-of-the-art algorithms (Resampling-based methods, KIAMB, and Iterative Removal) and examine sensitivity of the tested methods to high dimensionality.
  • Findings:
  • TIE* correctly identifies the set of true Markov boundaries (maximally predictive and non-redundant signatures) in the datasets with 30 or 1,000 variables;
  • Iterative Removal identifies only 1 signature;
  • KIAMB fails to identify any true signature, and its output signatures have poor predictivity;
  • Resampling-based methods either miss true signatures and/or output many redundant variables in the signatures.
experiments with linear continuous network lind
Experiments with linear continuous network LIND

LIND consists of 41 variables (including a response variable T) and contains 12 Markov boundaries of T.

experiments1
Experiments
  • Goals:
  • Analyze behavior of TIE* as a function of sample size using data generated from a continuous network;
  • Compare criteria Independence and Predictivity for verification of Markov boundaries in the TIE* algorithm.
  • Findings:
  • As sample size increases, the performance of both instantiations of TIE* generally improves and the algorithms discover the set of true Markov boundaries;
  • -level in the criterion Predictivity significantly affects the number of Markov boundaries output by the TIE* algorithm;
  • TIE* with criterion Predictivity typically leads to a larger number of output Markov boundaries and on average superior performance compared to criterion Independence.
experiments with discrete network xord
Experiments with discrete network XORD

XORD consists of 41 variables (including a response variable T) and contains 25 Markov boundaries of T.

experiments2
Experiments
  • Goal:Evaluate TIE* when the popular Markov boundary algorithms such as IAMB and HITON-PC are not applicable due to violations of their fundamental assumptions.
  • Findings:
  • TIE* discovers the set of true Markov boundaries when the sample is ≥ 2,000;
  • There is ~1 false positive variable in each discovered Markov boundary for large sample sizes.
b experiments with resimulated microarray gene expression data
B. Experiments with resimulated microarray gene expression data
  • Resimulated data by design closely resembles real human lung cancer microarray gene expression data;
  • The knowledge of a generative model allows to generate arbitrary large samples and study behavior of TIE* as a function of sample size;
  • Unlike prior experiments with artificial simulated datasets, the set of maximally predictive and non-redundant signatures is not known a priori.
experiment
Experiment

Goal:Examine whether the signature multiplicity phenomenon vanishes as the sample size grows.

Results:

findings of other experiments
Findings of other experiments
  • TIE* is not sensitive to the choice of the initial signature discovered by the algorithm;
  • Post-processing TIE* signatures with wrapping results in more signatures with smaller number of genes;
  • Signatures output by tested non-TIE* methods are either redundant or have inferior predictivity compared to signatures output by TIE* techniques.
c experiments with real human microarray gene expression data
C. Experiments with real human microarray gene expression data
  • Independent-Dataset Experiments:Using pairs of microarray datasets either from different laboratories or different platforms;
  • Single-Dataset Experiments:Additional experiments with relatively large sample size microarray datasets;
  • The primary goal of both experiments is to compare TIE* and baseline algorithms for multiple signature extraction in terms of maximal predictivity of induced signatures and reproducibility in independent data.
  • Operational definition of “maximal predictivity”: Empirically best classification performance (AUC) achievable in each dataset over all tested methods consideration.
tie signatures have maximal predictivity
TIE* signatures have maximal predictivity
  • TIE* achieves maximal predictivity in 5 out of 6 validation datasets;
  • Non-TIE* methods achieve maximal predictivity in 0 to 2 datasets depending on the method;
  • In the dataset where the predictivity of TIE* is statistically distinguishable from the empirically maximal one (Lung Cancer Subtype Classification), the magnitude of this difference is only 0.009 AUC on average over all discovered signatures.
tie signatures are reproducible other signatures may be overfitted
TIE* signatures are reproducible, other signatures may be overfitted
  • TIE* has no overfitting on average over all signatures and datasets;
  • Other methods achieve predictivity in the validation data that is lower than one in the discovery data (by 0.02-0.03 AUC), besides having inferior predictivity…
tie signatures in comparison with other signatures
TIE* signatures in comparison with other signatures

Predictivity results for Leukemia 5 Yr. Prognosis task

Multiple signatures output by TIE* have maximal predictivity & low variance

Classification performance (AUC) in discovery dataset

Multiple signatures output by other methods do not achieve maximal predictivity and have high variance

Each dot in the plot corresponds to a signature (computational model) of the outcome: E.g., Outcome(x)=Sign(w∙x+b),

where x, w m, b  , m is the number of genes in the signature.

Classification performance (AUC) in validation dataset

single dataset experiments datasets
Single-dataset experiments: Datasets
  • Validation dataset  subset of 100 samples/patients
  • Discovery dataset  all remaining samples/patients
  • Repeat splits into discovery & validation datasets 10 times to minimize variance
single dataset experiments summary results
Single-dataset experiments: Summary results
  • Results are similar to the ones from independent-dataset experiments;
  • TIE* achieves maximal predictivity in 6 out of 7 validation datasets;
  • Non-TIE* methods achieve maximal predictivity in 0 to 1 datasets depending on the method;
  • In the dataset where TIE* has predictivity that is statistically distinguishable from the empirically maximal one (Breast Cancer Subtype Classification II), the magnitude of this difference is only <0.01 AUC on average over all discovered signatures.
revisiting previously published hypotheses about signature multiplicity
Revisiting previously published hypotheses about signature multiplicity
  • Signature reproducibility neither precludes multiplicity nor requires sample sizes with thousands of subjects;
  • Multiplicity of signatures does not require dense connectivity;
  • Noisy measurements or normalization are not necessary conditions for signature multiplicity;
  • Multiplicity can be produced by a combination of small sample size-related variance and intrinsic multiplicity in the underlying network;
  • Multiple signatures output by TIE* are reproducible even though they are derived from small sample, noisy, and heavily-processed data.
a more complete picture is emerging regarding causes of multiplicity
A more complete picture is emerging regarding causes of multiplicity...
  • Intrinsic information redundancy in the underlying biological system;
  • Variability in the output of gene selection and classifier algorithmsespecially in small sample sizes;
  • Small sample statistical indistinguishability of signatures with different large sample predictivity and/or redundancy characteristics;
  • Presence of hidden variables;
  • Correlated measurement noise;
  • RNA amplification techniques that systematically distort measurements of transcript ratios;
  • Cellular aggregation and sampling from mixtures of distributionsthat affect inference of conditional independence relations;
  • Normalization and other data pre-processing methods that artificially increase correlations among genes;
  • Engineered redundancy in the assay technology platforms.
summary of results1
Summary of results
  • Developed a Markov boundary characterization of molecular signature multiplicity;
  • Designed a generative algorithm that can correctly identify the set of maximally predictive and non-redundant molecular signatures in principle independently of data distribution;
  • Conducted an empirical evaluation of the novel algorithm and compared it to existing state-of-the-art methods using artificial simulated, resimulated microarray gene expression, and real human microarray gene expression data;
  • Tested and refined several hypotheses about the causes of molecular signature multiplicity phenomenon.
general conclusions
General conclusions
  • Molecular signatures play a crucial role in personalized medicine and translational bioinformatics.
  • Molecular signatures are being used to treat patients today, not in the future.
  • Development of accurate molecular signature should rely on use of supervised methods.
  • In general, there are many challenges for computational analysis of omics data for development of molecular signatures.
  • One of these challenges is molecular signature multiplicity.
  • There exist an algorithm that can extract the set of maximally predictive and non-redundant molecular signatures from high-throughput data.
homework due next monday
Homework (Due next Monday)
  • Read the paper “Analysis and Computational Dissection of Molecular Signature Multiplicity”.
  • Describe a novel and interesting application area for TIE* algorithm. Feel free to use and example from your research where there exist many molecular signatures of some response variable (1/2 page max).
  • Come up with another cause of molecular signature multiplicity that was not mentioned in the paper (1/2 page max).

Email your work to Alexander.Statnikov@med.nyu.edu

slide112
Computational Causal Discovery Laboratory at NYU Center for Health Informatics and Bioinformatics (CHIBI)
  • The purpose of our lab is to develop, test and apply computational causal discovery methods suitable for molecular, clinical, imaging and multi-modal data of high-dimensionality.
  • We are interested in methods to address the following questions:
    • What is causing disease/phenotype?
    • What are the effects of disease/phenotype?
    • What are involved biological pathways?
    • How to design drugs/treatments?
    • How genotype causes differences in response to treatment?
    • How the environment modifies or even supersedes the normal causal function of genes and other molecular variables?
    • How genes and proteins are organized in complex causal regulatory networks?
  • Questions? Email to Alexander.Statnikov@med.nyu.edu