Systems approaches to disease stratification
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

Systems Approaches to Disease Stratification PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Systems Approaches to Disease Stratification. Nathan Price Introduction to Systems Biology Short Course August 20, 2012. Goals and Motivation. Currently most diagnoses based on symptoms and visual features (pathology, histology)

Download Presentation

Systems Approaches to Disease Stratification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Systems approaches to disease stratification

Systems Approaches to Disease Stratification

Nathan Price

Introduction to Systems Biology Short Course

August 20, 2012


Goals and motivation

Goals and Motivation

  • Currently most diagnoses based on symptoms and visual features (pathology, histology)

  • However, many diseases appear deceptively similar, but are, in fact, distinct entities from the molecular perspective

  • Drive towards personalized medicine


Outline

Outline

  • Molecular signature classifiers: main issues

    • Signal to noise

    • Small sample size issues

    • Error estimation techniques

    • Phenotypes and sample heterogeneity

    • Example study

  • Advanced topics

    • Network-based classification

    • Importance of broad disease context


Molecular signature classifiers

Molecular signature classifiers

Overall strategy


Molecular signatures for diagnosis

Molecular signatures for diagnosis

  • The goals of molecular classification of tumors:

    • Identify subpopulations of cancer

    • Inform choice of therapy

  • Generally, a set of microarray experiments is used with

    • ~100 patient samples

    • ~ 104 transcripts (genes)

  • This very small number of samples relative to the number of transcripts is a key issue

    • Feature selection & model selection

    • Small sample size issues dominate

    • Error estimation techniques

  • Also, the microarray platform used can have a significant effect on results


Randomness

Randomness

  • Expression values have randomness arising from both biological and experimental variability.

  • Design, performance evaluation, and application of classifiers must take this randomness into account.


Three critical issues arise

Three critical issues arise…

  • Given a set of variables, how does one design a classifier from the sample data that provides good classification over the general population?

  • How does one estimate the error of a designed classifier when data is limited?

  • Given a large set of potential variables, such as the large number of expression levels provided by each microarray, how does one select a set of variables as the input vector to the classifier?


Small sample issues

Small sample issues

  • Our task is to predict future events

    • Thus, we must avoid overfitting

    • It is easy (if the model is complicated enough) to fit data we have

    • Simplicity of model vital when data is sparse and possible relationships are large

      • This is exactly the case in virtually all microarray studies, including ours

  • In the clinic

    • At the end, want a test that can easily be implemented and actually benefit patients


Error estimation and variable selection

Error estimation and variable selection

  • An error estimator may be unbiased but have a large variance, and therefore often be low.

  • This can produce a large number of gene sets and classifiers with low error estimates.

  • For a small sample, one can end up with thousands of gene sets for which the error estimate from the sample data is near zero!


Overfitting

Overfitting

  • Complex decision boundary may be unsupported by the data relative to the feature-label distribution.

  • Relative to the sample data, a classifier may have small error; but relative to the feature-label distribution, the error may be severe!

  • Classification rule should not cut up the space in a manner too complex for the amount of sample data available.


Overfitting example of knn rule

Overfitting: example of KNN rule

N = 30

test sample; k = 3

N = 90


Example how to identify appropriate models regression but the issues are the same

Example: How to identify appropriate models(regression… but the issues are the same)

noise

learn f from data


Linear

Linear…


Quadratic

Quadratic…


Piecewise linear interpolation

Piecewise linear interpolation…


Which one is best

Which one is best?


Cross validation

Cross-validation


Cross validation1

Cross-validation


Cross validation2

Cross-validation

  • Simple: just choose the classifier with the best cross-validation error

  • But… (there is always a but)

    • we are training on even less data, so the classifier design is worse

    • if sample size is small, test set is small and error estimator has high variance

    • so we may be fooling ourselves into thinking we have a good classifier…


Loocv leave one out cross validation

LOOCV (leave-one-out cross validation)


Systems approaches to disease stratification

mean square error: 0.96

mean square error: 2.12

best

mean square error: 3.33


Estimating error on future cases

Estimating Error on Future Cases

Data Set

Resampling: Shuffled repeatedly into training and test sets.

Average performance on test set provides estimate for behavior on future cases

Can be MUCH different than behavior on training set

Training Set

Test Set

NO information passage

  • Methodology

  • Best case: have an independent test set

  • Resampling techniques

    • Use cross validation to estimate accuracy on future cases

    • Feature selection and model selection must be within loop to avoid overly optimistic estimates


Classification methods

Classification methods

  • k-nearest neighbor

  • Support vector machine (SVM)

  • Linear, quadratic

  • Perceptrons, neural networks

  • Decision trees

  • k-Top Scoring Pairs

  • Many others


Molecular signature classifiers1

Molecular signature classifiers

Example Study


Diagnosing similar cancers with different treatments

Diagnosing similar cancers with different treatments

?

GIST Patient

LMS Patient

  • Challenge in medicine: diagnosis, treatment, prevention of disease suffer from lack of knowledge

  • Gastrointestinal Stromal Tumor (GIST) and Leiomyosarcoma (LMS)

    • morphologically similar, hard to distinguish using current methods

    • different treatments, correct diagnosis is critical

    • studying genome-wide patterns of expression aids clinical diagnosis

  • Goal: Identify molecular signature that will accurately differentiate these two cancers


Relative expression reversal classifiers

Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004

Tan et al., Bioinformatics, 21:3896-904, 2005

Relative Expression Reversal Classifiers

  • Find a classification rule as follows:

    • IFgene A > gene BTHENclass1, ELSEclass2

  • Classifier is chosen finding the most accurate and robust rule of this type from all possible pairs in the dataset

  • If needed, a set of classifiers of the above form can be used, with final classification resulting from a majority vote (k-TSP)


Rationale for k tsp

Rationale for k-TSP

  • Based on concept of relative expression reversals

  • Advantages

    • Does not require data normalization

    • Does not require population-wide cutoffs or weighting functions

    • Has reported accuracies in literature comparable to SVMs, PAM, other state-of-the art classification methods

    • Results in classifiers that are easy to implement

    • Designed to avoid overfitting

      • n = number of genes, m = number of samples

      • For the example I will show, this equation yields:

      • 10^9 << 10^20


Diagnostic marker pair

Price, N.D. et al, PNAS 104:3414-9 (2007)

5

10

Classified as GIST

4

10

OBSCN expression

3

10

2

10

Clinicopathological

Diagnosis

X

GIST

Classified as LMS

O

-

LMS

1

10

1

2

3

4

5

10

10

10

10

10

C9orf65 expression

Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%

Diagnostic Marker Pair


Rt pcr classification results

Price, N.D. et al, PNAS 104:3414-9 (2007)

RT-PCR Classification Results

LMS

GIST

  • 100% Accuracy

    • 19 independent samples

    • 20 samples from microarray study

      • including previously indeterminate case

Price, N.D. et al, PNAS 104:3414-9 (2007)


Comparative biomarker accuracies

Price, N.D. et al, PNAS 104:3414-9 (2007)

Comparative biomarker accuracies

C-kit gene expression

GIST – X

LMS – O

2-gene relative expression classifier

Price, N.D. et al, PNAS 104:3414-9 (2007)


Kit protein staining of gist lms

Price, N.D. et al, PNAS 104:3414-9 (2007)

Kit Protein Staining of GIST-LMS

Blue arrows - GIST Red arrows - LMS

Accuracy as a classifier ~ 87%.

  • Top Row – GIST Positive Staining

  • Bottom Row – GIST negative staining


A few general lessons

A few general lessons

  • Choosing markers based on relative expression reversals of gene pairs has proven to be very robust with high predictive accuracy in sets we have tested so far

    • Simple and independent of normalization

  • Easy to implement clinical test ultimately

    • All that’s needed is RT-PCR on two genes

  • Advantages of this approach may be even more applicable to proteins in the blood

    • Each decision rule requiring the measurement of the relative concentration of 2 proteins


Network based classification

Network-based classification


Network based classification1

Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40

Network-based classification

  • Can modify feature selection methods based on networks

  • Can improve performance (not always)

  • Generally improves biological insight by integrating heterogeneous data

  • Shown to improve prediction of breast cancer metastasis (complex phenotype)


Rationale differential rank analysis dirac

Rationale: Differential Rank Analysis (DIRAC)

Price, N.D. et al, PNAS, 2007

  • Networks or pathways inform best targets for therapies

    • Cancer is a multi-genic disease

  • Analyze high-throughput data to identify aspects of the genome-scale network that are most affected

  • Initial version uses a priori defined gene sets

    • BioCarta, KEGG, GO, etc.

  • Differential rank conservation (DIRAC) for studying

    • Expression rank conservation for pathways within a phenotype

    • Pathways that discriminate well between phenotypes

Eddy, J.A. et al, PLoS Computational Biology (2010)


Differential rank conservation

Differential Rank Conservation

…across pathways in a phenotype

tightly regulated pathway

Highest conservation

g3

g3

g3

g3

…across phenotypes for a pathway

g1

g2

g2

g2

g1

g1

g1

g2

g4

g4

g4

g4

shuffled pathway ranking between phenotypes

weakly regulated pathway

GIST

LMS

g6

g5

g7

g7

g3

g4

g8

g8

g7

g6

g2

g1

Lowest conservation

g7

g8

g6

g6

g1

g3

g5

g5

g5

g8

g4

g2


Visualizing global network rank conservation

Visualizing global network rank conservation


Visualizing global network rank conservation1

Visualizing global network rank conservation

Average rank conservation across all 248 networks: 0.903


Global regulation of networks across phenotypes

Global regulation of networks across phenotypes

Highest rank conservation

Lowest rank conservation

Eddy et al, PLoS Computational Biology, (2010)


Global regulation of networks across phenotypes1

Global regulation of networks across phenotypes

Highest rank conservation

Lowest rank conservation

Tighter network regulation:normal prostate

Looser network regulation:primary prostate cancer

Loosest network regulation:metastatic prostate cancer

Eddy et al, PLoS Computational Biology, (2010)


Differential rank conservation1

Differential Rank Conservation

…across pathways in a phenotype

tightly regulated pathway

Highest conservation

g3

g3

g3

g3

…across phenotypes for a pathway

g1

g2

g2

g2

g1

g1

g1

g2

g4

g4

g4

g4

shuffled pathway ranking between phenotypes

weakly regulated pathway

GIST

LMS

g6

g5

g7

g7

g3

g4

g8

g8

g7

g6

g2

g1

Lowest conservation

g7

g8

g6

g6

g1

g3

g5

g5

g5

g8

g4

g2


Differential rank conservation of the mapk network

Differential rank conservation of the MAPK network


Dirac classification is comparable to other methods

DIRAC classification is comparable to other methods

Cross validation accuracies in prostate cancer


Differential rank conservation dirac key features

Eddy et al, PLoS Computational Biology, (2010)

Differential Rank Conservation (DIRAC): Key Features

  • Independent of data normalization

  • Independent of genes/proteins outside of network

  • Can show massive/complete perturbations

    • Unlike Fischer’s exact test (e.g. GO enrichment)

  • Measures the “shuffling” of the network in terms of the hierarchy of expression of he components

    • Distinct from enrichment or GSEA

  • Provides a distinct mathematically classifier to yield measurement of predictive accuracy on test data

    • Stronger than p-value for determining signal

  • Code for the method can be found at our website:

    http://price.systemsbiology.net


Global analysis of human disease

Global Analysis of Human Disease

Importance of broad context to disease diagnosis


Systems approaches to disease stratification

The envisioned future of blood diagnostics


Systems approaches to disease stratification

Next generation molecular disease-screening


Why global disease analyses are essential

Why global disease analyses are essential

  • Organ-specificity: separating signal from noise

  • Hierarchy of classification

    • Context-independent classifiers

      • Based on organ-specific markers

    • Context-dependent classifiers

      • Based on excellent markers once organ-specificity defined

  • Provide context for how disease classifiers should be defined

  • Provide broad perspective into how separable diseases are and if disease diagnosis categories seem appropriate


Global analysis of disease perturbed transcriptomes in the human brain

GLOBAL ANALYSIS OF DISEASE-PERTURBED TRANSCRIPTOMES IN THE HUMAN BRAIN

Example case study


Systems approaches to disease stratification

Multidimensional scaling plot of brain disease data


Identification of structured signatures and classifiers issac

Identification of Structured Signatures And Classifiers (ISSAC)

  • At each class in the decision tree, a test sample is either allowed to pass down the tree for further classification or rejected (i.e. 'does not belong to this class') and thus unable to pass


Systems approaches to disease stratification

Accuracy on randomly split test sets

100

100

99.0

98.2

97.6

100

94.7

90

classification accuracy (%)

84.2

81.8

80

60

40

20

0

GBM

MDL

MNG

NB

OLG

PRK

AI

ALZ

normal

/control

  • Average accuracy of all class samples: 93.9 %


The challenge of lab effects

The challenge of ‘Lab Effects’

Sample heterogeneity issues in personalized medicine


Systems approaches to disease stratification

Independent hold-out trials for 18 GSE datasets

100%

80%

60%

accuracy

40%

20%

0%

GBM (GSE4412)

Normal (GSE3526)

EPN (GSE16155)

EPN (GSE21687)

GBM (GSE9171)

MDL (GSE12992)

PA (GSE5675)

MNG (GSE4780)

MNG (GSE9438)

GBM (GSE4271)

GBM (GSE8692)

GBM (GSE4290)

MDL (GSE10327)

Normal (GSE7307)

MNG (GSE16581)

PA (GSE12907)

OLG (GSE4412)

OLG (GSE4290)


Leave batch out validation shows impact of other batch effects

Leave-batch-out validation shows impact of other batch effects


Take home messages

Take home messages

  • There is tremendous promise in high-throughput approaches to identify biomarkers

    • Significant challenges remain to their broad success

  • Integrative systems approaches are essential that link together data very broadly

  • If training set is representative of population, there are robust signals in the data and excellent accuracy is possible

  • Forward designs and partnering closely with clinical partners is essential, as is standardization of data collection and analysis


Summary

Summary

  • Molecular signature classifiers provide a promising avenue for disease stratification

  • Machine-learning approaches are key

    • Goal is optimal prediction of future data

    • Must avoid overfitting

  • Model complexity

    • Feature selection & model selection

  • Technical challenges

    • Measurement platforms

  • Network-based classification

  • Global disease context is key

  • Lab and batch effects critical to overcome

  • Sampling of heterogeneity for some disease now sufficient to achieve stability in classification accuracies


Systems approaches to disease stratification

Acknowledgments

Nathan D. Price Research LaboratoryInstitute for Systems Biology, Seattle, WA | University of Illinois, Urbana-Champaign, IL

Collaborators

Don Geman(Johns Hopkins)

Wei Zhang (MD Anderson)

Price Lab Members

Seth Ament, PhD

Daniel Baker

Matthew Benedict

Julie Bletz, PhD

Victor Cassen

Sriram Chandrasekaran

Nicholas Chia, PhD (now Ast. Prof. at Mayo Clinic)

John Earls

James Eddy

Cory Funk, PhD

Pan Jun Kim, PhD (now Ast. Prof. at POSTECH)

Alexey Kolodkin, PhD

Charu Gupta Kumar, PhD

Ramkumar Hariharan, PhD

Ben Heavner, PhD

Piyush Labhsetwar

Andrew Magis

Caroline Milne

Shuyi Ma

Beth Papanek

Matthew Richards

Areejit Samal, PhD

Vineet Sangar, PhD

Bozenza Sawicka

Evangelos Simeonidis

Jaeyun Sung

Chunjing Wang

  • Funding

  • NIH / National Cancer Institute - Howard Temin Pathway to Independence Award

  • NSF CAREER

  • Department of Energy

  • Energy Biosciences Institute (BP)

  • Department of Defense (TATRC)

  • Luxembourg-ISB Systems Medicine Program

  • Roy J. Carver Charitable Trust Young Investigator Award

  • Camille Dreyfus Teacher-Scholar Award


  • Login