Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory

Molecular Signaling & Drug Development Course:Development of Molecular Signatures from High-Throughput Assay Data Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory Benchmarking Director, Best Practices Integrative Informatics Consultation Service Assistant Professor, Department of Medicine, Division of Clinical Pharmacology Center for Health Informatics and Bioinformatics , NYU School of Medicine 5/16/2011

Outline • Part 1: Introduction to molecular signatures • Part 2: Key principles for developing accurate molecular signatures • Part 3: Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification • Part 4: Analysis and computational dissection of molecular signature multiplicity • Conclusion • Homework assignment

Part 1: Introductiontomolecularsignatures

Definition of a molecular signature Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

FDA view on molecular signatures The FDA calls them “in vitro diagnostic multivariate index assays” 1. “Class II Special Controls Guidance Document: Gene Expression Profiling Test System for Breast Cancer Prognosis”: • Addresses device classification. 2. “The Critical Path to New Medical Products”: - Identifies pharmacogenomics as crucial to advancing medical product development and personalized medicine. 3. “Draft Guidance on Pharmacogenetic Tests and Genetic Tests for Heritable Markers” & “Guidance for Industry: Pharmacogenomic Data Submissions” • Identifies 3 main goals (dose, ADEs, responders), • Defines IVDMIA, • Encourages “fault-free” sharing of pharmacogenomic data, • Separates “probable” from “valid” biomarkers, • Focuses on genomics (and not other omics).

Main uses of molecular signatures • Direct benefits: Models of disease phenotype/clinical outcome • Diagnosis • Prognosis, long-term disease management • Personalized treatment (drug selection, titration) • Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction • Make the above tasks resource efficient, and easy to use in clinical practice • Helps next-generation molecular imaging • Leads for potential new drug candidates • Ancillary benefits 2: Discovery of structure & mechanisms(regulatory/interaction networks, pathways, sub-types) • Leads for potential new drug candidates

Less conventional uses of molecular signatures • Increase clinical trial sample efficiency and decrease costs or both, using placebo responder signatures; • In silico signature-based candidate drug screening; • Drug “resurrection”; • Establishing existence of biological signal in very small sample situations where univariate signals are too weak; • Assess importance of markers and of mechanisms involving those; • Choosing the right animal model; • …?

Recent molecular signatures available for patient care Agendia Clarient Prediction Sciences LabCorp University Genomics Genomic Health Veridex BioTheranostics Applied Genomics Power3 OvaSure Correlogic Systems

Molecular signatures in the market

MammaPrint • Developed by Agendia (www.agendia.com) • 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease • Independently validated in >1,000 patients • So far performed >10,000 tests • Cost of the test is ~$3,000 • In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm. • TIME Magazine’s 2007 “medical invention of the year”.

Oncotype DX • Developed by Genomic Health (www.genomichealth.com ) • 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse • Independently validated in >1,000 patients • So far performed >50,000 tests • Cost of the test is ~$3,000 • The following paper shows the health benefits and cost-effectiveness benefits of using Oncotype DX: http://www3.interscience.wiley.com/cgi-bin/abstract/114124513/ABSTRACT

Part 2:Key principles for developing accurate molecular signatures

Main ingredients for developing a molecular signature Well-defined clinical problem & access to patients/ samples Computational & biostatistical Analysis Molecular Signature High-throughput assays

Challenges in computational analysis of omics data • Relatively easy to develop a predictive model and even easier to believe that a model is good when it is not  false sense of security • Several problems exist: some theoretical and some practical • Omics data has many special characteristics and is tricky to analyze!

Example: OvaCheck (1/2) • Developed by Correlogic (www.correlogic.com) • Blood test for the early detection of epithelial ovarian cancer • Failed to obtain FDA approval • Looks for subtle changes in patterns among the tens of thousands of proteins, protein fragments and metabolites in the blood • Signature developed by genetic algorithm • Significant artifacts in data collection & analysis questioned validity of the signature: • Results are not reproducible • Data collected differently for different groups of patients http://www.nature.com/nature/journal/v429/n6991/full/429496a.html

Example: OvaCheck (2/2) A B C Figure from Baggerly et al (Bioinformatics, 2004) D E F

An early kind of analysis: Learning disease sub-types by clustering patient profiles p53 Rb

Clustering: Seeking ‘natural’ groupings & hoping that they will be useful… p53 Rb

E.g., for classification (predict response to treatment) p53 Respond to treatment Tx1 Do not Respond to treatment Tx1 Rb

Another use of clustering • Cluster genes (instead of patients): • Genes that cluster together may belong to the same pathways • Genes that cluster apart may be unrelated

Unfortunately clustering is a non-specific method and falls into the ‘one-solution fits all’ trap when used for classification p53 Squamous carcinoma Adenocarcinoma Rb

Clustering is also non-specific when used to discover pathways, or other mechanistic relationships It is entirely possible in this simple illustrative counter-example for G3 (a causally unrelated gene to the phenotype) to be more strongly associated and thus cluster with the phenotype (or its surrogate genes) than the true causal oncogenes G1, G2 G1 G2 Ph G3

Two improved classes of methods • Supervised learning classification/molecular signatures and markers • Regulatory network reverse engineering  pathways

Supervised learning: Use the known phenotypes (a.k.a. “class labels”) in training data to build signatures or find markers highly specific for that phenotype A Classifier/ Regression Algorithm Training samples B C Molecular signature D T Testing/ Validation samples A1, B1, C1, D1, T1 A2, B2, C2, D2, T2 An, Bn, Cn, Dn, Tn Classification Performance

Input data for supervised learning methods Class LabelVariables/features Primary Metastatic Primary Metastatic Metastatic Primary Metastatic Metastatic Metastatic Primary Metastatic Primary

Principles and geometric representation for supervised learning (1/7) • Want to classify objects as boats and houses.

Principles and geometric representation for supervised learning (2/7) • All objects before the coast line are boats and all objects after the coast line are houses. • Coast line serves as a decision surface that separates two classes.

Principles and geometric representation for supervised learning (3/7) These boats will be misclassified as houses This house will be misclassified as boat

Principles and geometric representation for supervised learning (4/7) Longitude Boat House Latitude • The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. • First all objects are represented geometrically.

Principles and geometric representation for supervised learning (5/7) Longitude Boat House Latitude Then the algorithm seeks to find a decision surface that separates classes of objects

Principles and geometric representation for supervised learning (6/7) Longitude These objects are classified as houses ? ? ? ? ? ? These objects are classified as boats Latitude Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

Principles and geometric representation for supervised learning (7/7) Longitude Object #1 Object #2 Object #3 Latitude

In 2-D this looks simple but what happens in higher dimensional data… • 10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays) • >500,000 (tiled microarrays, SNP arrays) • 10,000-300,000 (regular MS proteomics) • >10,000,000 (LC-MS proteomics) • >100,000,000 (next-generation sequencing) This is the ‘curse of dimensionality problem’

High-dimensionality (especially with small samples) causes: • Some methods do not run at all (classical regression) • Some methods give bad results (KNN, Decision trees) • Very slow analysis • Very expensive/cumbersome clinical application • Tends to “overfit”

Two problems: Over-fitting & Under-fitting • Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data • Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit Outcome of Interest Y This line is good! This line overfits! Training Data Future Data Predictor X

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit Outcome of Interest Y This line is good! This line underfits! Training Data Future Data Predictor X

Very important concept… • Successful data analysis methods balance training data fit with complexity. • Too complex signature (to fit training data well) overfitting (i.e., signature does not generalize) • Too simplistic signature (to avoid overfitting)  underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small).

The Support Vector Machine (SVM) approach for building molecular signatures • Support vector machines (SVMs) is a binary classification algorithm. • SVMs are important because of (a) theoretical reasons: • Robust to very large number of variables and small samples • Can learn both simple and highly complex classification models • Employ sophisticated mathematical principles to avoid overfitting and (b) superior empirical results.

Main ideas of SVMs (1/3) Gene Y Normal patients Cancer patients Gene X • Consider example dataset described by 2 genes, gene X and gene Y • Represent patients geometrically (by “vectors”)

Main ideas of SVMs (2/3) Gene Y Gap Normal patients Cancer patients Gene X • Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);

Main ideas of SVMs (3/3) • If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very clever mathematical projection (“kernel trick”).

On estimation of signature accuracy data test train Large sample case: use hold-out validation data train test train train train train train Small sample case: use N-fold cross-validation test test test test test

train test train train What combination of learner parameters to apply on training data? test test train train valid train valid Perform “grid search” using another nested loop of cross-validation. Nested N-fold cross-validation Recall the main idea of cross-validation: data

Overview of challenges in computational analysis of omics data for development of molecular signatures Rashomon effect/ Marker multiplicity Assay validity/ reproducibility Research Designs Efficiency: Statistical/ Computational Is there predictive signal? Data Analytics of Molecular Signatures Causality vs predictiveness/ Biological Significance Methods Development: Re-inventing the wheel & specialization Epistasis Many variables, small sample, noise, artifacts Instability Performance: Predictivity, compactness Protocols/Guidelines Editorializing/ Over-simplifying/ Sensationalism

Part 3:Comprehensive evaluation of algorithms to develop molecular signatures for cancer classification

Comprehensive evaluation of algorithms for classification of cancer microarray data • Main goals: • Find the best performing algorithms for building molecular signatures for cancer diagnosis from microarray gene expression data; • Investigate benefits of using gene selection and ensemble classification methods.

Classification algorithms • K-Nearest Neighbors (KNN) • Backpropagation Neural Networks (NN) • Probabilistic Neural Networks (PNN) • Multi-Class SVM: One-Versus-Rest (OVR) • Multi-Class SVM: One-Versus-One (OVO) • Multi-Class SVM: DAGSVM • Multi-Class SVM by Weston & Watkins (WW) • Multi-Class SVM by Crammer & Singer (CS) • Weighted Voting: One-Versus-Rest • Weighted Voting: One-Versus-One • Decision Trees: CART instance-based neural networks kernel-based voting decision trees

dataset Classifier 1 Classifier 2 Classifier N … Prediction 1 Prediction 2 … Prediction N dataset Ensemble Classifier Final Prediction Ensemble classification methods

Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory

Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory

Presentation Transcript

Automatic Causal Discovery

Benjamin Haibe-Kains Director , Bioinformatics and Computational Genomics Laboratory

Feature Selection and Causal discovery

Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Moderator Denise Stephenson Hawk, Ph.D. Associate Director, NCAR Director, SERE Laboratory

Case Study: Causal Discovery Methods Using Causal Probabilistic Networks

Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics

Causal Discovery

Erika Stippler, Ph.D. Director Dosage Form Performance Laboratory

Alexander M. Sterin, Ph.D

CSC 599: Computational Scientific Discovery

CSC 599: Computational Scientific Discovery

Computational Materials Science Laboratory

Lilliam Rosario, Ph.D. Director Office of Computational Science

Purdue’s Discovery Park A. H. Rebar, DVM, Ph.D. Executive Director of Discovery Park and

Alexander Gallo - Discovery Litigation Services

Causal Discovery

CSC 599: Computational Scientific Discovery

Gary H. Glover, Ph.D. Director, Radiological Sciences Laboratory

Moderator Denise Stephenson Hawk, Ph.D. Associate Director, NCAR Director, SERE Laboratory

Feature Selection and Causal discovery