Loading in 2 Seconds...

Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory

Loading in 2 Seconds...

- 72 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory' - mira-burke

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Molecular Signaling & Drug Development Course:Development of Molecular Signatures from High-Throughput Assay Data

Alexander Statnikov, Ph.D.

Director, Computational Causal Discovery Laboratory

Benchmarking Director, Best Practices Integrative Informatics Consultation Service

Assistant Professor, Department of Medicine, Division of Clinical Pharmacology

Center for Health Informatics and Bioinformatics , NYU School of Medicine

5/16/2011

Outline

- Part 1: Introduction to molecular signatures
- Part 2: Key principles for developing accurate molecular signatures
- Part 3: Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification
- Part 4: Analysis and computational dissection of molecular signature multiplicity
- Conclusion
- Homework assignment

Definition of a molecular signature

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

FDA view on molecular signatures

The FDA calls them “in vitro diagnostic multivariate index assays”

1. “Class II Special Controls Guidance Document: Gene Expression Profiling Test System for Breast Cancer Prognosis”:

- Addresses device classification.

2. “The Critical Path to New Medical Products”:

- Identifies pharmacogenomics as crucial to advancing medical product developmentand personalized medicine.

3. “Draft Guidance on Pharmacogenetic Tests and Genetic Tests for Heritable Markers” & “Guidance for Industry: Pharmacogenomic Data Submissions”

- Identifies 3 main goals (dose, ADEs, responders),
- Defines IVDMIA,
- Encourages “fault-free” sharing of pharmacogenomic data,
- Separates “probable” from “valid” biomarkers,
- Focuses on genomics (and not other omics).

Main uses of molecular signatures

- Direct benefits: Models of disease phenotype/clinical outcome
- Diagnosis
- Prognosis, long-term disease management
- Personalized treatment (drug selection, titration)
- Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction
- Make the above tasks resource efficient, and easy to use in clinical practice
- Helps next-generation molecular imaging
- Leads for potential new drug candidates
- Ancillary benefits 2: Discovery of structure & mechanisms(regulatory/interaction networks, pathways, sub-types)
- Leads for potential new drug candidates

Less conventional uses of molecular signatures

- Increase clinical trial sample efficiency and decrease costs or both, using placebo responder signatures;
- In silico signature-based candidate drug screening;
- Drug “resurrection”;
- Establishing existence of biological signal in very small sample situations where univariate signals are too weak;
- Assess importance of markers and of mechanisms involving those;
- Choosing the right animal model;
- …?

- available for patient care

Agendia

Clarient

Prediction Sciences

LabCorp

University Genomics

Genomic Health

Veridex

BioTheranostics

Applied Genomics

Power3

OvaSure

Correlogic Systems

• Developed by Agendia (www.agendia.com)

• 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease

• Independently validated in >1,000 patients

• So far performed >10,000 tests

• Cost of the test is ~$3,000

• In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm.

• TIME Magazine’s 2007 “medical invention of the year”.

Oncotype DX

• Developed by Genomic Health (www.genomichealth.com )

• 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse

• Independently validated in >1,000 patients

• So far performed >50,000 tests

• Cost of the test is ~$3,000

• The following paper shows the health benefits and cost-effectiveness benefits of using Oncotype DX: http://www3.interscience.wiley.com/cgi-bin/abstract/114124513/ABSTRACT

Main ingredients for developing a molecular signature

Well-defined

clinical problem &

access to patients/

samples

Computational &

biostatistical

Analysis

Molecular Signature

High-throughput

assays

Challenges in computational analysis of omics data

- Relatively easy to develop a predictive model and even easier to believe that a model is good when it is not false sense of security
- Several problems exist: some theoretical and some practical
- Omics data has many special characteristics and is tricky to analyze!

Example: OvaCheck (1/2)

- Developed by Correlogic (www.correlogic.com)
- Blood test for the early detection of epithelial ovarian cancer
- Failed to obtain FDA approval
- Looks for subtle changes in patterns among the tens of thousands of proteins, protein fragments and metabolites in the blood
- Signature developed by genetic algorithm
- Significant artifacts in data collection & analysis questioned validity of the signature:
- Results are not reproducible
- Data collected differently for different groups of patients

http://www.nature.com/nature/journal/v429/n6991/full/429496a.html

E.g., for classification (predict response to treatment)

p53

Respond to treatment Tx1

Do not

Respond to treatment Tx1

Rb

Another use of clustering

- Cluster genes (instead of patients):
- Genes that cluster together may belong to the same pathways
- Genes that cluster apart may be unrelated

Unfortunately clustering is a non-specific method and falls into the ‘one-solution fits all’ trap when used for classification

p53

Squamous carcinoma

Adenocarcinoma

Rb

Clustering is also non-specific when used to discover pathways, or other mechanistic relationships

It is entirely possible in this simple illustrative counter-example for G3 (a causally unrelated gene to the phenotype) to be more strongly associated and thus cluster with the phenotype (or its surrogate genes) than the true causal oncogenes G1, G2

G1

G2

Ph

G3

Two improved classes of methods

- Supervised learning classification/molecular signatures and markers
- Regulatory network reverse engineering pathways

Supervised learning: Use the known phenotypes (a.k.a. “class labels”) in training data to build signatures or find markers highly specific for that phenotype

A

Classifier/

Regression

Algorithm

Training

samples

B

C

Molecular signature

D

T

Testing/

Validation

samples

A1, B1, C1, D1, T1

A2, B2, C2, D2, T2

An, Bn, Cn, Dn, Tn

Classification

Performance

Input data for supervised learning methods

Class LabelVariables/features

Primary

Metastatic

Primary

Metastatic

Metastatic

Primary

Metastatic

Metastatic

Metastatic

Primary

Metastatic

Primary

Principles and geometric representation for supervised learning (1/7)

- Want to classify objects as boats and houses.

Principles and geometric representation for supervised learning (2/7)

- All objects before the coast line are boats and all objects after the coast line are houses.
- Coast line serves as a decision surface that separates two classes.

Principles and geometric representation for supervised learning (3/7)

These boats will be misclassified as houses

This house will be misclassified as boat

Principles and geometric representation for supervised learning (4/7)

Longitude

Boat

House

Latitude

- The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example.
- First all objects are represented geometrically.

Principles and geometric representation for supervised learning (5/7)

Longitude

Boat

House

Latitude

Then the algorithm seeks to find a decision surface that separates classes of objects

Principles and geometric representation for supervised learning (6/7)

Longitude

These objects are classified as houses

?

?

?

?

?

?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

Principles and geometric representation for supervised learning (7/7)

Longitude

Object #1

Object #2

Object #3

Latitude

In 2-D this looks simple but what happens in higher dimensional data…

- 10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays)
- >500,000 (tiled microarrays, SNP arrays)
- 10,000-300,000 (regular MS proteomics)
- >10,000,000 (LC-MS proteomics)
- >100,000,000 (next-generation sequencing)

This is the ‘curse of dimensionality problem’

High-dimensionality (especially with small samples) causes:

- Some methods do not run at all (classical regression)
- Some methods give bad results (KNN, Decision trees)
- Very slow analysis
- Very expensive/cumbersome clinical application
- Tends to “overfit”

Two problems: Over-fitting & Under-fitting

- Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data
- Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Outcome of Interest Y

This line is good!

This line overfits!

Training Data

Future Data

Predictor X

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Outcome of Interest Y

This line is good!

This line underfits!

Training Data

Future Data

Predictor X

Very important concept…

- Successful data analysis methods balance training data fit with complexity.
- Too complex signature (to fit training data well) overfitting (i.e., signature does not generalize)
- Too simplistic signature (to avoid overfitting) underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small).

The Support Vector Machine (SVM) approach for building molecular signatures

- Support vector machines (SVMs) is a binary classification algorithm.
- SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

Main ideas of SVMs (1/3)

Gene Y

Normal patients

Cancer patients

Gene X

- Consider example dataset described by 2 genes, gene X and gene Y
- Represent patients geometrically (by “vectors”)

Main ideas of SVMs (2/3)

Gene Y

Gap

Normal patients

Cancer patients

Gene X

- Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);

Main ideas of SVMs (3/3)

- If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found;
- The feature space is constructed via very clever mathematical projection (“kernel trick”).

On estimation of signature accuracy

data

test

train

Large sample case:

use hold-out validation

data

train

test

train

train

train

train

train

Small sample case: use N-fold cross-validation

test

test

test

test

test

test

train

train

What combination of learner parameters to apply on training data?

test

test

train

train

valid

train

valid

Perform “grid search” using another nested loop of cross-validation.

Nested N-fold cross-validationRecall the main idea of cross-validation:

data

Overview of challenges in computational analysis of omics data for development of molecular signatures

Rashomon effect/

Marker multiplicity

Assay validity/

reproducibility

Research Designs

Efficiency: Statistical/

Computational

Is there

predictive signal?

Data Analytics

of

Molecular

Signatures

Causality vs predictiveness/

Biological Significance

Methods Development:

Re-inventing the wheel &

specialization

Epistasis

Many variables,

small sample,

noise, artifacts

Instability

Performance:

Predictivity,

compactness

Protocols/Guidelines

Editorializing/

Over-simplifying/

Sensationalism

Part 3:Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification

Comprehensive evaluation of algorithms for classification of cancer microarray data

- Main goals:
- Find the best performing algorithms for building molecular signatures for cancer diagnosis from microarray gene expression data;
- Investigate benefits of using gene selection and ensemble classification methods.

Classification algorithms

- K-Nearest Neighbors (KNN)
- Backpropagation Neural Networks (NN)
- Probabilistic Neural Networks (PNN)
- Multi-Class SVM: One-Versus-Rest (OVR)
- Multi-Class SVM: One-Versus-One (OVO)
- Multi-Class SVM: DAGSVM
- Multi-Class SVM by Weston & Watkins (WW)
- Multi-Class SVM by Crammer & Singer (CS)
- Weighted Voting: One-Versus-Rest
- Weighted Voting: One-Versus-One
- Decision Trees: CART

instance-based

neural

networks

kernel-based

voting

decision trees

Classifier 1

Classifier 2

Classifier N

…

Prediction 1

Prediction 2

…

Prediction N

dataset

Ensemble

Classifier

Final

Prediction

Ensemble classification methodsUninformative genes

genes

Gene selection methods- Signal-to-noise (S2N) ratio in one-versus-rest (OVR) fashion;
- Signal-to-noise (S2N) ratio in one-versus-one (OVO) fashion;
- Kruskal-Wallis nonparametric one-way ANOVA (KW);
- Ratio of genes between-categories to within-category sum of squares (BW).

Performance metrics andstatistical comparison

- Accuracy
- + can compare to previous studies
- + easy to interpret & simplifies statistical comparison

- 2. Relative classifier information (RCI)
- + easy to interpret & simplifies statistical comparison
- + not sensitive to distribution of classes
- + accounts for difficulty of a decision problem

- Randomized permutation testing to compare accuracies
- of the classifiers (=0.05)

Microarray datasets

- Total:
- ~1300 samples
- 74diagnostic categories
- 41 cancer types and
- 12 normal tissue types

Designs (2)

Performance

Metrics (2)

Statistical

Comparison

GeneSelection

Methods (4)

Accuracy

10-Fold CV

S2N One-Versus-Rest

Randomized

permutation testing

RCI

LOOCV

S2N One-Versus-One

Non-param. ANOVA

GeneExpressionDatasets (11)

Classifiers (11)

BW ratio

11_Tumors

One-Versus-Rest

14_Tumors

One-Versus-One

EnsembleClassifiers (7)

MC-SVM

9_Tumors

DAGSVM

Brain Tumor1

Method by WW

Majority Voting

Multicategory Dx

Brain_Tumor2

Method by CS

MC-SVM OVR

Based on MC-

SVM outputs

Leukemia1

KNN

MC-SVM OVO

Leukemia2

Backprop. NN

MC-SVM DAGSVM

Lung_Cancer

Prob. NN

Decision Trees

SRBCT

Decision Trees

Majority Voting

Based on outputs

of all classifiers

Prostate_Tumors

Binary Dx

One-Versus-Rest

WV

Decision Trees

DLBCL

One-Versus-One

Summary of methods and datasets9_Tumors

14_Tumors

100

60

80

50

60

40

40

Improvement in accuracy, %

20

30

Accuracy, %

Brain_Tumor1

Brain_Tumor2

100

20

80

10

60

40

OVR

OVO

DAGSVM

WW

CS

KNN

NN

PNN

20

SVM

non-SVM

SVM

non-SVM

SVM

non-SVM

Results with gene selectionImprovement of diagnostic performance by gene selection (averages for the four datasets)

Diagnostic performance

before and after gene selection

Average reduction of genes is 10-30 times

(this study)

100

80

60

Multiple specialized

classification methods

(original primary studies)

Accuracy, %

40

20

0

Comparison with previously published resultsSummary of results

- Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV.
- Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms;
- Ensemble classification does not improve performance of SVM and other classifiers;
- Results obtained by SVMs favorably compare with the literature.

Random Forest (RF) classifiers

- Appealing properties
- Work when # of predictors > # of samples
- Embedded gene selection
- Incorporate interactions
- Based on theory of ensemble learning
- Can work with binary & multiclass tasks
- Does not require much fine-tuning of parameters
- Strong theoretical claims
- Empirical evidence: (Diaz-Uriarte and Alvarez de Andres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared to SVMs and other methods

Key principles of RF classifiers

Testing

data

Training

data

4) Apply to testing data & combine predictions

1) Generate bootstrap samples

2) Random gene

selection

3) Fit unpruned decision trees

Results without gene selection

- SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets.
- In 7 datasets SVMs outperform RFs statistically significantly.
- On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI.

Results with gene selection

- SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets.
- In 1 dataset SVMs outperform RFs statistically significantly.
- On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI.

Part 4:Analysis and computational dissection of molecular signature multiplicity

Molecular signature multiplicity

- Different methods or samples from the same population lead to different but apparently maximally predictive signatures;
- Far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments:
- Generation of biological hypotheses is very hard even when signatures are maximally predictive of the phenotype since thousands of completely different signatures are equally consistent with the data;
- Produced signatures are not statistically generalizable to new cases, and thus not reliable enough for translation to clinical practice.

Molecular signature multiplicity

- Causes of this phenomenon are unknown; several contradictory conjectures exist in the field:
- Signature multiplicity is due to small samples [Michiels et al., 2005]
- Signature multiplicity leads to predictively non-reproducible signatures [Ein-Dor et al., 2006]; building reproducible signatures requires thousands of samples [Ioannidis, 2005]
- Signature multiplicity is a by-product of the complex regulatory connectivity of genome [Dougherty and Brun, 2006]
- Artifacts of data pre-processing, e.g. normalization [Gold et al., 2005; Qiu et al., 2005; Ploner et al., 2005]

Major goals

- Develop a Markov boundary characterization of molecular signature multiplicity phenomenon;
- Design and study algorithms that can correctly identify the set of maximally predictive and non-redundant molecular signatures;
- Conduct an empirical evaluation of the novel algorithms and compare to the existing state-of-the-art methods;
- Test and refine previously stated hypotheses about the causes of signature multiplicity phenomenon.

Optimality criteria of signatures

Signatures that are focus of this research satisfy the following two optimality criteria:

- maximally predictive of the phenotype (they achieve best predictivity of the phenotype in the given dataset over all signatures based on different gene sets);
- do not contain predictively redundant genes (i.e., genes that can be removed from the signature without adversely affecting its predictivity).

Why do we need algorithms to extract as many optimal signatures as possible?

- A deeper understanding of the signature multiplicity phenomenon and how it affects reproducibility of signatures;
- Improving discovery of the underlying biological mechanisms by not missing genes that are implicated biologically in disease processes;
- Catalyzing regulatory approval by establishing in-silico equivalence to previously validated signatures

Existing algorithms for multiple signature extraction: Resampling-based methods

Training

data

- Based on assumption that multiplicity is strictly a small-sample phenomenon;
- An infinite number of resamplings is required to extract all optimal signatures;
- May stop producing multiple signatures in large sample sizes.

1) Generate resampled

datasets (e.g., by bootstrapping)

…

2) Apply a standard signature extraction algorithm (e.g., SVM-RFE)

X1

X2

X3

XN

Existing algorithms for multiple signature extraction: Iterative removal

Original data (for all genes)

Remove corresponding genes from the dataset

X1

Reduced data (excluding X1 genes)

Remove corresponding genes from the dataset

X2

Reduced data (excluding X1 and X2 genes)

Remove corresponding genes from the dataset

X3

…until a signature has statistically significantly reduced predictivity

- Agnostic to what causes molecular signature multiplicity;
- Cannot discover signatures that have genes in common.

Existing algorithms for multiple signature extraction: Stochastic gene selection

Genetic Algorithms (e.g., GA/KNN or GA/SVM)

- Can output all signatures that are discoverable by a genetic algorithm when it is allowed to evolve an infinite number of generations.

KIAMB

- Stochastic Markov boundary method based on IAMB algorithm;
- In a specific class of distributions, every optimal signature will be output by this method with nonzero probability;
- Requires an infinite number of iterations to discover all optimal signatures; will discover same signature over and over again;
- Sample requirements are of exponential order to the number of genes in a signatures.

Existing algorithms for multiple signature extraction: Brute-force exhaustive search

LIKNON

- Examines predictivity of all individual genes in the dataset, all pairs of genes, all triples of genes, and so on;
- It is infeasible when a signature has more than 2-3 genes;
- Agnostic to what causes signature multiplicity.

In summary, no current algorithm provides a systematic and efficient approach for identification of the set of maximally predictive and non-redundant molecular signatures that exist in the underlying distribution.

Key definitions (1/2)

- Definition of maximally predictive molecular signature:A maximally predictive molecular signature is a molecular signature that maximizes predictivity of the phenotype relative to all other signatures that can be constructed from the same dataset.
- Definition of maximally predictive and non-redundant molecular signature:A maximally predictive and non-redundant molecular signature based on variables X is a maximally predictive signature such that any signature based on a proper subset of variables in X is not maximally predictive.

Key definitions (2/2)

- Definition ofMarkov blanket: A Markov blanket Mof the response variable TV in the joint probability distribution P over variables V is a set of variables conditioned on which all other variables are independent of T, i.e. for every ,

.

- Definition of Market boundary (or non-redundant Markov blanket): If M is a Markov blanket of T and no proper subset of M satisfies the definition of Markov blanket of T, then M is called a Markov boundary (or non-redundant Markov blanket) of T.

Theoretical results

- Variable sets that participate in the maximally predictive signatures of T are precisely the Markov blankets of T and vice-versa;
- Similarly, variable sets that participate in the maximally predictive and non-redundant signatures of T are precisely the Markov boundaries of T and vice-versa;
- If a joint probability distribution P over variables V satisfies the intersection property*, then there exists a unique Markov boundary of T[Pearl, 1988].

A fundamental reduction used in this research for the analysis of signatures

S1

S2

S3

S4

S5

Cases

Gene Y

Controls

*

*

*

+

+

*

*

+

*

*

*

+

*

*

*

*

*

*

+

*

*

+

*

+

*

*

*

+

*

+

*

*

+

+

+

*

*

*

+

+

+

Signatures that have maximal predictivity of the phenotype relative to their genes.

Signatures with worse predictivity

Gene X

- Since there is an infinite number of signatures with maximal predictivity, when I refer to a signature, I mean one of the predictively equivalent classifiers (e.g., S3 or S4 or S5);
- Can study signature classes by reference only to their genes;
- This reduction is justified whenever the classifiers used can learn the minimum error decision function given sufficient sample.

Example of Markov boundary multiplicity

Network structure

Distributional constraints

Many optimal signatures exist: e.g., {A, C} and {B, C} are maximally predictive and non-redundant signatures of T. Furthermore, {A, C} and {B, C} remain maximally predictive even in infinite samples;

The network has very low connectivity;

Genes in optimal signatures do not have to be deterministically related: e.g., A and B are not deterministically related, yet convey individually the same information about T;

If an algorithm selects only one optimal signature, then there is danger to miss biologically important causative genes;

The union of all optimal signatures includes all genes located in the local pathway around T;

In this example the intersection of all optimal signatures contains only genes in the local pathway around T.

II. A Novel algorithm to correctly identify the set of maximally predictive and non-redundant signatures

Trace of the TIE* algorithm

Not a Markov boundary;

Do not consider any G that

is a superset of {F}

G={F}

Mnew = {A, B}

G={A}

Mnew = {C, B, F}

Markov boundary

M = {A, B, F}

G={B}

Mnew = {A, D, E, F}

Markov boundary

Mnew = {C, D, E, F}

Markov boundary

G={A,B}

…

Theoretical results (1/2)

- TIE* returns all and only Markov boundaries of T (i.e., maximally predictive and non-redundant signatures) if its input components X, Y, Z are admissible
- IAMB is an admissible Markov boundary algorithm (input component X) under assumptions
- IAMB correctly outputs a Markov boundary if only the composition property holds
- HITON-PC is an admissible Markov boundary algorithm (input component X) under assumptions
- HITON-PC correctly outputs a Markov boundary if the adjacency faithfulness assumption holds except for violations of the intersection axiom, global Markov condition holds, and there are no “spouses” in the Markov boundary

Theoretical results (2/2)

- Stated three strategies (IncLex, IncMinAssoc, and IncMaxAssoc) to generate subsets of variables that have to be removed from V to identify new Markov boundaries of T and proved their admissibility (input component Y)
- Stated two criteria (Independence and Predictivity)to verify Markov boundaries and proved their admissibility (input component Z).

III. Empirical evaluation of the novel algorithms and comparison with existing state-of-the-art methods

A. Experiments with artificial simulated data

Generative model is available, and the set of Markov boundaries (and thus the set of maximally predictive and non-redundant signatures) is known.

- Generate samples of systematically varied sizes;
- Compare to the gold standard;
- Test whether the TIE* algorithm behaves according to theoretical expectations and study its empirical properties;
- Obtain clues about behavior of TIE* and baseline comparison algorithms in experiments with real gene expression data.

Experiments with discrete networks TIED1 and TIED2

- Two artificial discrete networks were created:
- TIED1 consists of 30 variables (including a response variable T) and contains 72 Markov boundaries of T;
- TIED2 consists of 1,000 variables (including a response variable T) and contains the same 72 Markov boundaries of T as TIED1.

Experiments

- Goal:Compare TIE* to state-of-the-art algorithms (Resampling-based methods, KIAMB, and Iterative Removal) and examine sensitivity of the tested methods to high dimensionality.
- Findings:
- TIE* correctly identifies the set of true Markov boundaries (maximally predictive and non-redundant signatures) in the datasets with 30 or 1,000 variables;
- Iterative Removal identifies only 1 signature;
- KIAMB fails to identify any true signature, and its output signatures have poor predictivity;
- Resampling-based methods either miss true signatures and/or output many redundant variables in the signatures.

Experiments with linear continuous network LIND

LIND consists of 41 variables (including a response variable T) and contains 12 Markov boundaries of T.

Experiments

- Goals:
- Analyze behavior of TIE* as a function of sample size using data generated from a continuous network;
- Compare criteria Independence and Predictivity for verification of Markov boundaries in the TIE* algorithm.
- Findings:
- As sample size increases, the performance of both instantiations of TIE* generally improves and the algorithms discover the set of true Markov boundaries;
- -level in the criterion Predictivity significantly affects the number of Markov boundaries output by the TIE* algorithm;
- TIE* with criterion Predictivity typically leads to a larger number of output Markov boundaries and on average superior performance compared to criterion Independence.

Experiments with discrete network XORD

XORD consists of 41 variables (including a response variable T) and contains 25 Markov boundaries of T.

Experiments

- Goal:Evaluate TIE* when the popular Markov boundary algorithms such as IAMB and HITON-PC are not applicable due to violations of their fundamental assumptions.
- Findings:
- TIE* discovers the set of true Markov boundaries when the sample is ≥ 2,000;
- There is ~1 false positive variable in each discovered Markov boundary for large sample sizes.

B. Experiments with resimulated microarray gene expression data

- Resimulated data by design closely resembles real human lung cancer microarray gene expression data;
- The knowledge of a generative model allows to generate arbitrary large samples and study behavior of TIE* as a function of sample size;
- Unlike prior experiments with artificial simulated datasets, the set of maximally predictive and non-redundant signatures is not known a priori.

Experiment

Goal:Examine whether the signature multiplicity phenomenon vanishes as the sample size grows.

Results:

Findings of other experiments

- TIE* is not sensitive to the choice of the initial signature discovered by the algorithm;
- Post-processing TIE* signatures with wrapping results in more signatures with smaller number of genes;
- Signatures output by tested non-TIE* methods are either redundant or have inferior predictivity compared to signatures output by TIE* techniques.

C. Experiments with real human microarray gene expression data

- Independent-Dataset Experiments:Using pairs of microarray datasets either from different laboratories or different platforms;
- Single-Dataset Experiments:Additional experiments with relatively large sample size microarray datasets;
- The primary goal of both experiments is to compare TIE* and baseline algorithms for multiple signature extraction in terms of maximal predictivity of induced signatures and reproducibility in independent data.
- Operational definition of “maximal predictivity”: Empirically best classification performance (AUC) achievable in each dataset over all tested methods consideration.

TIE* signatures have maximal predictivity

- TIE* achieves maximal predictivity in 5 out of 6 validation datasets;
- Non-TIE* methods achieve maximal predictivity in 0 to 2 datasets depending on the method;
- In the dataset where the predictivity of TIE* is statistically distinguishable from the empirically maximal one (Lung Cancer Subtype Classification), the magnitude of this difference is only 0.009 AUC on average over all discovered signatures.

TIE* signatures are reproducible, other signatures may be overfitted

- TIE* has no overfitting on average over all signatures and datasets;
- Other methods achieve predictivity in the validation data that is lower than one in the discovery data (by 0.02-0.03 AUC), besides having inferior predictivity…

TIE* signatures in comparison with other signatures

Predictivity results for Leukemia 5 Yr. Prognosis task

Multiple signatures output by TIE* have maximal predictivity & low variance

Classification performance (AUC) in discovery dataset

Multiple signatures output by other methods do not achieve maximal predictivity and have high variance

Each dot in the plot corresponds to a signature (computational model) of the outcome: E.g., Outcome(x)=Sign(w∙x+b),

where x, w m, b , m is the number of genes in the signature.

Classification performance (AUC) in validation dataset

Single-dataset experiments: Datasets

- Validation dataset subset of 100 samples/patients
- Discovery dataset all remaining samples/patients
- Repeat splits into discovery & validation datasets 10 times to minimize variance

Single-dataset experiments: Summary results

- Results are similar to the ones from independent-dataset experiments;
- TIE* achieves maximal predictivity in 6 out of 7 validation datasets;
- Non-TIE* methods achieve maximal predictivity in 0 to 1 datasets depending on the method;
- In the dataset where TIE* has predictivity that is statistically distinguishable from the empirically maximal one (Breast Cancer Subtype Classification II), the magnitude of this difference is only <0.01 AUC on average over all discovered signatures.

Revisiting previously published hypotheses about signature multiplicity

- Signature reproducibility neither precludes multiplicity nor requires sample sizes with thousands of subjects;
- Multiplicity of signatures does not require dense connectivity;
- Noisy measurements or normalization are not necessary conditions for signature multiplicity;
- Multiplicity can be produced by a combination of small sample size-related variance and intrinsic multiplicity in the underlying network;
- Multiple signatures output by TIE* are reproducible even though they are derived from small sample, noisy, and heavily-processed data.

A more complete picture is emerging regarding causes of multiplicity...

- Intrinsic information redundancy in the underlying biological system;
- Variability in the output of gene selection and classifier algorithmsespecially in small sample sizes;
- Small sample statistical indistinguishability of signatures with different large sample predictivity and/or redundancy characteristics;
- Presence of hidden variables;
- Correlated measurement noise;
- RNA amplification techniques that systematically distort measurements of transcript ratios;
- Cellular aggregation and sampling from mixtures of distributionsthat affect inference of conditional independence relations;
- Normalization and other data pre-processing methods that artificially increase correlations among genes;
- Engineered redundancy in the assay technology platforms.

Summary of results

- Developed a Markov boundary characterization of molecular signature multiplicity;
- Designed a generative algorithm that can correctly identify the set of maximally predictive and non-redundant molecular signatures in principle independently of data distribution;
- Conducted an empirical evaluation of the novel algorithm and compared it to existing state-of-the-art methods using artificial simulated, resimulated microarray gene expression, and real human microarray gene expression data;
- Tested and refined several hypotheses about the causes of molecular signature multiplicity phenomenon.

General conclusions

- Molecular signatures play a crucial role in personalized medicine and translational bioinformatics.
- Molecular signatures are being used to treat patients today, not in the future.
- Development of accurate molecular signature should rely on use of supervised methods.
- In general, there are many challenges for computational analysis of omics data for development of molecular signatures.
- One of these challenges is molecular signature multiplicity.
- There exist an algorithm that can extract the set of maximally predictive and non-redundant molecular signatures from high-throughput data.

Homework (Due next Monday)

- Read the paper “Analysis and Computational Dissection of Molecular Signature Multiplicity”.
- Describe a novel and interesting application area for TIE* algorithm. Feel free to use and example from your research where there exist many molecular signatures of some response variable (1/2 page max).
- Come up with another cause of molecular signature multiplicity that was not mentioned in the paper (1/2 page max).

Email your work to Alexander.Statnikov@med.nyu.edu

Computational Causal Discovery Laboratory at NYU Center for Health Informatics and Bioinformatics (CHIBI)

- The purpose of our lab is to develop, test and apply computational causal discovery methods suitable for molecular, clinical, imaging and multi-modal data of high-dimensionality.
- We are interested in methods to address the following questions:
- What is causing disease/phenotype?
- What are the effects of disease/phenotype?
- What are involved biological pathways?
- How to design drugs/treatments?
- How genotype causes differences in response to treatment?
- How the environment modifies or even supersedes the normal causal function of genes and other molecular variables?
- How genes and proteins are organized in complex causal regulatory networks?
- Questions? Email to Alexander.Statnikov@med.nyu.edu

Download Presentation

Connecting to Server..