Loading in 5 sec....

Classification of Microarray Data - Recent Statistical ApproachesPowerPoint Presentation

Classification of Microarray Data - Recent Statistical Approaches

- 209 Views
- Updated On :

Classification of Microarray Data - Recent Statistical Approaches. Geoff McLachlan and Liat Ben-Tovim Jones Department of Mathematics & Institute for Molecular Bioscience University of Queensland Tutorial for the APBC January 2005.

Related searches for Classification of Microarray Data - Recent Statistical Approaches

Download Presentation
## PowerPoint Slideshow about 'Classification of Microarray Data - Recent Statistical Approaches' - ashanti

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Example of a null case: with 7 data is very high dimensional with very little replicationN(0,1) points and

### Mixtures of Factor Analyzers sensitive to outlying observations by using

### GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) 4 5

Classification of Microarray Data- Recent Statistical Approaches

Geoff McLachlan and Liat Ben-Tovim Jones

Department of Mathematics & Institute for Molecular Bioscience

University of Queensland

Tutorial for the APBC January 2005

Outline of Tutorial Queensland

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Outline of Tutorial Queensland

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

BREAK

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Outline of Tutorial Queensland

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

“Large-scale gene expression studies are not a passing fashion, but are instead one aspect of new work of biological experimentation, one involving large-scale, high throughput assays.”

Speed et al., 2002, Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall/ CRC

Growth of microarray and microarray methodology literature listed in PubMed from 1995 to 2003.

The category ‘all microarray papers’ includes those found by searching PubMed for microarray* OR ‘gene expression profiling’. The category ‘statistical microarray papers’ includes those found by searching PubMed for ‘statistical method*’ OR ‘statistical techniq*’ OR ‘statistical approach*’ AND microarray* OR ‘gene expression profiling’.

A microarray is listed in PubMed from 1995 to 2003.a new technology which allows the

measurement of the expression levels of thousands of genes

simultaneously.

- sequencing of the genome (human, mouse, and others)
- (2) improvement in technology to generate high-density
- arrays on chips (glass slides or nylon membrane).

The entire genome of an organism can be probed at a single point in time.

(1) mRNA Levels Indirectly Measure Gene Activity listed in PubMed from 1995 to 2003.

Every cell contains the same DNA.

The activity of a gene (expression) can be determined by the presence of its complementary mRNA.

Cells differ in the DNA (gene) which is active at any one time.

Genes code for proteins through the intermediary of mRNA.

Target and Probe DNA listed in PubMed from 1995 to 2003.

Probe DNA

- known

Sample (target)

- unknown

(2) Microarrays Indirectly Measure Levels of mRNA listed in PubMed from 1995 to 2003.

- mRNA is extracted from the cell

- mRNA is reverse transcribed to cDNA (mRNA itself is unstable)

- cDNA is labeled with fluorescent dye TARGET

- The sample is hybridized to known DNA sequences on the array
- (tens of thousands of genes) PROBE

- If present, complementary target binds to probe DNA
- (complementary base pairing)

- Target bound to probe DNA fluoresces

Spotted cDNA Microarray listed in PubMed from 1995 to 2003.

Compare the gene expression levels for two cell

populations on a single microarray.

Microarray Image listed in PubMed from 1995 to 2003.

Red: High expression in target labelled with cyanine 5 dye

Green : High expression in target labelled with cyanine 3 dye

Yellow : Similar expression in both target samples

Assumptions: listed in PubMed from 1995 to 2003.

Gene Expression

(1)

cellular mRNA levels directly

reflect gene expression

mRNA

intensity of bound target is a

measure of the abundance of the

mRNA in the sample.

(2)

Fluorescence Intensity

Experimental Error listed in PubMed from 1995 to 2003.

Sample contamination

Poor quality/insufficient mRNA

Reverse transcription bias

Fluorescent labeling bias

Hybridization bias

Cross-linking of DNA (double strands)

Poor probe design (cross-hybridization)

Defective chips (scratches, degradation)

Background from non-specific hybridization

The Microarray Technologies listed in PubMed from 1995 to 2003.

Spotted Microarray

Affymetrix GeneChip

cDNAs, clones, or short and long

oligonucleotides deposited onto

glass slides

Each gene (or EST) represented

by its purified PCR product

Simultaneous analysis of two

samples (treated vs untreated cells)

provides internal control.

short oligonucleotides synthesized in situ

onto glass wafers

Each gene represented multiply - using

16-20 (preferably non-overlapping)

25-mers.

Each oligonucleotide has single-base

mismatch partner for internal control of

hybridization specifity.

relative gene expressions

absolute gene expressions

Each with its own advantages and disadvantages

Pros and Cons of the Technologies listed in PubMed from 1995 to 2003.

Spotted Microarray

Affymetrix GeneChip

Flexible and cheaper

Allows study of genes not yet sequenced

(spotted ESTs can be used to discover

new genes and their functions)

Variability in spot quality from slide to

slide

Provide information only on relative

gene expressions between cells or tissue

samples

More expensive yet less flexible

Good for whole genome expression

analysis where genome of that organism

has been sequenced

High quality with little variability between

slides

Gives a measure of absolute expression

of genes

Aims of a Microarray Experiment listed in PubMed from 1995 to 2003.

- observe changes in a gene in response to external stimuli
- (cell samples exposed to hormones, drugs, toxins)
- compare gene expressions between different tissue types
- (tumour vs normal cell samples)

- To gain understanding of
- function of unknown genes
- disease process at the molecular level
- Ultimately to use as tools in Clinical Medicine for diagnosis,
- prognosis and therapeutic management.

Importance of Experimental Design listed in PubMed from 1995 to 2003.

- Good DNA microarray experiments should have clear objectives.
- not performed as “aimless data mining in search of unanticipated patterns that will provide answers to unasked
- questions”
- (Richard Simon, BioTechniques 34:S16-S21, 2003)

Replicates listed in PubMed from 1995 to 2003.

Technical replicates: arrays that have been hybridized to the same

biological source (using the same treatment, protocols, etc.)

Biological replicates: arrays that have been hybridized to different

biological sources, but with the same preparation, treatments, etc.

Extracting Data from the Microarray listed in PubMed from 1995 to 2003.

- Cleaning
- Image processing
- Filtering
- Missing value estimation

- Normalization
- Remove sources of systematic variation.

Sample 1

Sample 2

Sample 3

Sample 4 etc…

Gene Expressions from Measured Intensities listed in PubMed from 1995 to 2003.

Spotted Microarray:

log 2(Intensity Cy5 / Intensity Cy3)

Affymetrix:

(Perfect Match Intensity – Mismatch Intensity)

Data Transformation listed in PubMed from 1995 to 2003.

Rocke and Durbin (2001), Munson (2001), Durbin et al. (2002),

and Huber et al. (2002)

Representation of Data from listed in PubMed from 1995 to 2003.M Microarray Experiments

Sample 1 Sample 2 Sample M

Gene 1

Gene 2

Gene N

Assume we have

extracted gene expressions values from intensities.

Expression Signature

Gene expressions

can be shown as

Heat Maps

Expression Profile

Microarrays present new problems for statistics because the data is very high dimensional with very little replication.

Gene Expression Data represented as data is very high dimensional with very little replicationN x M Matrix

Sample 1 Sample 2 Sample M

Gene 1

Gene 2

Gene N

Expression Signature

N rows correspond to the

N genes.

M columns correspond to the

M samples (microarray

experiments).

Expression Profile

Microarray Data Notation data is very high dimensional with very little replication

Represent the N x M matrix A:

A = (y1, ... , yM) Classifying Tissues on Gene Expressions

the feature vector yj contains the expression levels on the N genes

in the jth experiment (j = 1, ... , M). yj is the expression signature.

AT = (y1, ... , yN) Classifying Genes on the Tissues

the feature vector yj contains the expression levels on the M tissues

on the jth gene (j = 1, ... , N). yj is the expression profile.

In the data is very high dimensional with very little replicationN x M matrix A:

N = No. of genes (103-104)

M = No. of tissues (10-102)

Classification of Tissues on Gene Expressions:

Standard statistical methodology appropriate

when M >> N,

BUT here N >> M

Classification of Genes on the Basis of the Tissues:

Falls in standard framework, BUT not all the

genes are independently distributed.

Mehta et al (Nature Genetics, Sept. 2004): data is very high dimensional with very little replication

“The field of expression data analysis is particularly active with novel analysis strategies and tools being published weekly”, and the value of many of these methods is questionable. Some results produced by using these methods are so anomalous that a breed of ‘forensic’ statisticians (Ambroise and McLachlan, 2002; Baggerly et al., 2003) who doggedly detect and correct other HDB (high-dimensional biology) investigators’ prominent mistakes, has been created.

Outline of Tutorial data is very high dimensional with very little replication

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Sample 1 data is very high dimensional with very little replication

Sample 2

Sample M

Gene 1

Gene 2

.

.

.

.

.

.

.

.

Gene N

Sample 1 data is very high dimensional with very little replication

Sample 2

Sample M

Gene 1

Gene 2

.

.

.

.

.

.

.

.

Gene N

Class 2

Class 1

Fold Change is the Simplest Method data is very high dimensional with very little replication

Calculate the log ratio between the two classes

and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.

Fold change is not a statistical test.

Multiplicity Problem data is very high dimensional with very little replication

Perform a test for each gene to determine the statistical significance

of differential expression for that gene.

Problem:When many hypotheses are tested, the probability of a

type I error (false positive) increases sharply with the number of

hypotheses.

Further complicated by gene co-regulation and subsequent

correlation between the test statistics.

Example: data is very high dimensional with very little replication

Suppose we measure the expression of 10,000 genes

in a microarray experiment.

If all 10,000 genes were not differentially expressed,

then we would expect for:

P= 0.05 for each test, 500 false positives.

P= 0.05/10,000 for each test, .05 false positives.

Methods for dealing with the Multiplicity Problem data is very high dimensional with very little replication

- The Bonferroni Method
- controls the family wise error rate (FWER).
- (FWER is the probability that at least one false positive
- error will be made.) - but this method is very
- conservative, as it tries to make it unlikely that even
- one false rejection is made.

- The False Discovery Rate (FDR)
- emphasizes the proportion of false positives among the identified differentially expressed genes.

Test of a Single Hypothesis data is very high dimensional with very little replication

The M tissue samples are classified with respect to g classes on

the basis of the N gene expressions.

Assume that there are ni tissue samples from each class

Ci (i = 1, …, g), where

M = n1 + … + ng.

Take a simple case where g = 2.

The aim is to detect whether some of the thousands of genes

have different expression levels in class C1 than in class C2.

Test of a Single Hypothesis (cont.) data is very high dimensional with very little replication

For gene j, let Hj denote the null hypothesis of no association

between its expression level and membership of the class, where

(j = 1, …, N).

Hj = 0 Null hypothesis for the jth gene holds.

Hj = 1 Null hypothesis for the jth gene does not hold.

Two-Sample data is very high dimensional with very little replicationt-Statistic

Student’s t-statistic:

Two-Sample data is very high dimensional with very little replicationt-Statistic

Pooled form of the Student’s t-statistic, assumed common

variance in the two classes:

Two-Sample data is very high dimensional with very little replicationt-Statistic

Modified t-statistic of Tusher et al. (2001):

Multiple Hypothesis Testing data is very high dimensional with very little replication

Consider measures of error for multiple hypotheses.

Focus on the rate offalse positiveswith respect to the

number ofrejected hypotheses, Nr.

Possible Outcomes for data is very high dimensional with very little replicationN Hypothesis Tests

Possible Outcomes for data is very high dimensional with very little replicationN Hypothesis Tests

FWER is the probability of getting one or more

false positives out of all the hypotheses tested:

Bonferroni method for controlling the FWER data is very high dimensional with very little replication

Consider N hypothesis tests:

H0j versus H1j, j = 1, … , N

and let P1, … , PN denote the NP-values for these tests.

The Bonferroni Method:

Given P-values P1, … , PN reject null hypothesis H0jif

Pj < a / N .

False Discovery Rate (FDR) data is very high dimensional with very little replication

The FDR is essentially the expectation of the

proportion of false positives among the identified

differentially expressed genes.

Possible Outcomes for data is very high dimensional with very little replicationN Hypothesis Tests

Controlling FDR data is very high dimensional with very little replication

Benjamini and Hochberg (1995)

Key papers on controlling the FDR

- Genovese and Wasserman (2002)
- Storey (2002, 2003)
- Storey and Tibshirani (2003a, 2003b)
- Storey, Taylor and Siegmund (2004)
- Black (2004)
- Cox and Wong (2004)

Benjamini-Hochberg (BH) Procedure data is very high dimensional with very little replication

Controls the FDR at level a when the P-values following

the null distribution are independentand uniformly distributed.

(1) Let be the observed P-values.

(2) Calculate .

(3) If exists then reject null hypotheses corresponding to

. Otherwise, reject nothing.

Example: Bonferroni and BH Tests data is very high dimensional with very little replication

Suppose that 10 independent hypothesis tests are carried out

leading to the following ordered P-values:

0.00017 0.00448 0.00671 0.00907 0.01220

0.33626 0.39341 0.53882 0.58125 0.98617

(a) With a = 0.05, the Bonferroni test rejects any hypothesis

whose P-value is less than a / 10 = 0.005.

Thus only the first two hypotheses are rejected.

(b) For the BH test, we find the largest k such thatP(k) < ka / m.

Here k = 5, thus we reject the first five hypotheses.

SHORT BREAK data is very high dimensional with very little replication

Null Distribution of the Test Statistic data is very high dimensional with very little replication

Permutation Method

The null distribution has a resolution on the order of the

number of permutations.

If we perform B permutations, then the P-value will be estimated

with a resolution of 1/B.

If we assume that each gene has the same null distribution and

combine the permutations, then the resolution will be 1/(NB)

for the pooled null distribution.

Using just the data is very high dimensional with very little replicationB permutations of the class labels for the

gene-specific statistic Tj , the P-value for Tj = tj is assessed as:

where t(b)0j is the null version of tj after the bth permutation

of the class labels.

If we pool over all data is very high dimensional with very little replicationN genes, then:

Null Distribution of the Test Statistic: Example data is very high dimensional with very little replication

Class 1 Class 2

Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1)

Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2)

Suppose we have two classes of tissue samples, with three samples

from each class. Consider the expressions of two genes, Gene 1 and

Gene 2.

Class 1 Class 2 data is very high dimensional with very little replication

Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1)

Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2)

To find the null distribution of the test statistic for Gene 1, we

proceed under the assumption that there is no difference between the

classes (for Gene 1) so that:

Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1) A6(1)

And permute the class labels:

Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1) A6(1)

...

There are 10 distinct permutations.

Ten Permutations of Gene 1 data is very high dimensional with very little replication

A1(1) A2(1) A3(1) A4(1) A5(1) A6(1)

A1(1) A2(1) A4(1) A3(1) A5(1) A6(1)

A1(1) A2(1) A5(1) A3(1) A4(1) A6(1)

A1(1) A2(1) A6(1) A3(1) A4(1) A5(1)

A1(1) A3(1) A4(1) A2(1) A5(1) A6(1)

A1(1) A3(1) A5(1) A2(1) A4(1) A6(1)

A1(1) A3(1) A6(1) A2(1) A4(1) A5(1)

A1(1) A4(1) A5(1) A2(1) A3(1) A6(1)

A1(1) A4(1) A6(1) A2(1) A3(1) A5(1)

A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)

As there are only 10 distinct permutations here, the data is very high dimensional with very little replication

null distribution based on these permutations is too

granular.

Hence consideration is given to permuting the labels

of each of the other genes and estimating the null

distribution of a gene based on the pooled permutations

so obtained.

But there is a problem with this method in that the

null values of the test statistic for each gene does not

necessarily have the theoretical null distribution that

we are trying to estimate.

Suppose we were to use Gene 2 also to estimate data is very high dimensional with very little replication

the null distribution of Gene 1.

Suppose that Gene 2 is differentially expressed,

then the null values of the test statistic for Gene 2

will have a mixed distribution.

Class 1 Class 2 data is very high dimensional with very little replication

Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1)

Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2)

Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2)

Permute the class labels:

Perm. 1 A1(2) A2(2) B4(2)A3(2) B5(2) B6(2)

...

Example of a null case: with 7 data is very high dimensional with very little replicationN(0,1) points and

8 N(0,1) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

Example of a null case: with 7 data is very high dimensional with very little replicationN(0,1) points and

8 N(10,9) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

The SAM Method data is very high dimensional with very little replication

Use the permutation method to calculate the null

distribution of the modified t-statistic (Tusher et al., 2001).

The order statistics t(1), ... , t(N) are plotted against their null

expectations above.

A good test in situations where there are more genes being

over-expressed than under-expressed, or vice-versa.

Two-component Mixture Model Framework data is very high dimensional with very little replication

Two-component model data is very high dimensional with very little replication

is the proportion of genes that are not

differentially expressed, and

is the proportion that are.

Two-component model data is very high dimensional with very little replication

is the proportion of genes that are not

differentially expressed, and

is the proportion that are.

Then

is the posterior probability that gene j is not differentially expressed.

1) Form a statistic data is very high dimensional with very little replicationtj for each gene, where a large positivevalue of tj corresponds to a gene that is differentially expressed across the tissues.2) Compute the Pj-values according to the tj and fit a mixture

of beta distributions (including a uniform component)

to them where the latter corresponds to the class of genes that

are not differentially expressed.

or

3) Fit to data is very high dimensional with very little replicationt1,...,tp a mixture of two normal densities with a common variance, where the first component has the smaller mean (it corresponds to the class of genes that are not differentially expressed). It is assumed that the tj

have been transformedso that they are normally distributed (approximately).

tj

4) Let 0(tj) denote the (estimated) posterior probability that

gene j belongs to the first component of the mixture.

If we conclude that gene data is very high dimensional with very little replicationj is differentially

expressed if:

0(tj) c0,

then this decision minimizes the (estimated)

Bayes risk

where

Estimated FDR data is very high dimensional with very little replication

where

Suppose data is very high dimensional with very little replicationt0(t) is monotonic decreasing in t. Then

for

Suppose data is very high dimensional with very little replicationt0(t) is monotonic decreasing in t. Then

for

Suppose data is very high dimensional with very little replicationt0(t) is monotonic decreasing in t. Then

for

where

For a desired control level data is very high dimensional with very little replicationa, say a = 0.05, define

(1)

t

is monotonic in t, then using (1)

If

to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic tj.

Example data is very high dimensional with very little replication

The study of Hedenfalk et al. (2001), used cDNA arrays to

measure the gene expressions in breast tumour tissues taken from

patients with mutations in either the BRCA1 or BRCA2 genes.

We consider their data set of M = 15 patients, comprising

two patient groups: BRCA1 (7) versus BRCA2-mutation

positive (8), with N = 3,226 genes.

The problem is to find genes which are differentially

expressed between the BRCA1 and BRCA2 patients.

Example of a null case: with 7 data is very high dimensional with very little replicationN(0,1) points and

8 N(0,1) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

8 N(10,9) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

Fit data is very high dimensional with very little replication

to the N values of tj (pooled two-sample t-statistic)

jth gene is taken to be differentially expressed if

Estimated FDR for various levels of data is very high dimensional with very little replicationc0

Use of the data is very high dimensional with very little replicationP-Value as a Summary Statistic

(Allison et al., 2002)

Instead of using the pooled form of the t-statistic, we can work

with the value pj, which is the P-value associated with tj

in the test of the null hypothesis of no difference in expression

between the two classes.

The distribution of the P-value is modelled by the h-component

mixture model

,

where a11 = a12 = 1.

Use of the data is very high dimensional with very little replicationP-Value as a Summary Statistic

Under the null hypothesis of no difference in expression for the

jth gene, pjwill have a uniform distribution on the unit interval;

ie the b1,1 distribution.

The ba1,a2 density is given by

where

Outline of Tutorial data is very high dimensional with very little replication

- Introduction to microarray technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Gene Expression Data represented as data is very high dimensional with very little replicationN x M Matrix

Sample 1 Sample 2 Sample M

Gene 1

Gene 2

Gene N

Expression Signature

N rows correspond to the

N genes.

M columns correspond to the

M samples (microarray

experiments).

Expression Profile

Two Clustering Problems: data is very high dimensional with very little replication

Clustering of genes on basis of tissues –

genes not independent

(n = N, p = M)

- Clustering of tissues on basis of genes -
- latter is a nonstandard problem in
- cluster analysis

(n = M, p = N, so n << p)

CLUSTERING OF GENES WITH REPEATED MEASUREMENTS data is very high dimensional with very little replication:

Medvedovic and Sivaganesan (2002)

Celeux, Martin, and Lavergne (2004)

Chapter 5 of McLachlan et al. (2004)

UNSUPERVISED CLASSIFICATION data is very high dimensional with very little replication

(CLUSTER ANALYSIS)

INFER CLASS LABELSz1, …, zn of y1, …,yn

Initially, hierarchical distance-based methods

of cluster analysis were used to cluster the tissues and the genes

Eisen, Spellman, Brown, & Botstein (1998, PNAS)

Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters.

“in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”

(Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)

McLachlan and Khan (2004). largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. On a resampling approach for tests on the number of clusters with mixture model-based clustering of the tissue samples.

Special issue of the Journal of Multivariate Analysis 90 (2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).

Attention is now turning towards a model-based approach to the analysis of microarray data

For example:

- Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology9
- Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18
- Liu, Zhang, Palumbo, and Lawrence(2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7
- Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3
- Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17

The notion of a cluster is not easy to define. the analysis of microarray data

There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis.

That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data).

Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

where models with elliptically symmetric components such as the normal or

where

constant

constant

MAHALANOBIS DISTANCE

EUCLIDEAN DISTANCE

MIXTURE OF g NORMAL COMPONENTS

k-means models with elliptically symmetric components such as the normal or

k-means

SPHERICAL CLUSTERS

MIXTURE OF g NORMAL COMPONENTS

Equal spherical covariance matrices models with elliptically symmetric components such as the normal or

With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components.

Figure 7: Contours of the fitted component densities on the 2nd & 3rd variates for the blue crab data set.

Estimation of Mixture Distributions 2

It was the publication of the seminal paper ofDempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest inthe use of finite mixture distributions to model heterogeneous data.

McLachlan and Krishnan (1997, Wiley)

- If need be, the normal mixture model can be made less sensitive to outlying observations by using t component densities.
- With this t mixture model-based approach, the normal distribution for each component in the mixture is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom.

The advantage of the sensitive to outlying observations by using t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger.

Mixtures of Factor Analyzers sensitive to outlying observations by using

A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data.

One approach for reducing the number of

parameters is to work in a lower dimensional

space by using principal components; another is to use mixtures of factor analyzers (Ghahramani & Hinton, 1997).

Two Groups in Two Dimensions. All cluster information would sensitive to outlying observations by using

be lost by collapsing to the first principal component. The

principal ellipses of the two groups are shown as solid curves.

Principal components or a single-factor analysis model provides only a global linear model.

A global nonlinear approach by postulating a mixture of linear submodels

B sensitive to outlying observations by using iis ap x q matrix andDiis a

diagonal matrix.

Single-Factor Analysis Model sensitive to outlying observations by using

The sensitive to outlying observations by using Uj areiid N(O, Iq)independently of the errors ej, which areiidas N(O, D),where D is a diagonal matrix

Conditional on sensitive to outlying observations by using ithcomponent membership of the mixture,

whereUi1, ..., Uinareindependent, identically distibuted (iid)N(O, Iq),independently of theeij, which are iidN(O, Di),whereDiis a diagonal matrix(i = 1, ..., g).

An infinity of choices for sensitive to outlying observations by using Bifor model still holds ifBi is replaced by BiCi whereCi is an orthogonal matrix.

ChooseCi so that

is diagonal

Number of free parameters is then

Reduction in the number of parameters is then sensitive to outlying observations by using

We can fit the mixture of factor analyzers model using an alternating ECM algorithm.

1st cycle: declare the missing data to be the component-indicator vectors. Update the estimates of

and

2nd cycle: declare the missing data to be also the factors. Update the estimates of

and

M-step on 1 component-indicator vectors. Update the estimates ofst cycle:

fori = 1, ... , g .

M step on 2 component-indicator vectors. Update the estimates ofnd cycle:

where

Work in q-dim space: component-indicator vectors. Update the estimates of

(BiBiT + Di ) -1=

Di –1 -Di -1Bi (Iq+ BiTDi -1Bi) -1BiTDi -1,

|BiBiT+Di| =

|Di | /|Iq -BiT(BiBiT+Di) -1Bi| .

PROVIDES A MODEL-BASED APPROACH TO CLUSTERING component-indicator vectors. Update the estimates of

McLachlan, Bean, and Peel, 2002, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data,Bioinformatics18, 413-422

http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf

Example: Microarray Data component-indicator vectors. Update the estimates ofColon Data of Alon et al. (1999)

M = 62 (40 tumours; 22 normals)

tissue samples of

N = 2,000 genes in a

2,000 62 matrix.

Mixture of 2 normal components component-indicator vectors. Update the estimates of

Mixture of 2 component-indicator vectors. Update the estimates oft components

Clustering of COLON Data component-indicator vectors. Update the estimates of

Genes using EMMIX-GENE

Tissues using EMMIX-GENE

Heat Map Displaying the Reduced Set of 4,869 Genes 4 5

on the 98 Breast Cancer Tumours

Insert heat map of 1867 genes 4 5

Heat Map of Top 1867 Genes

i m 4 5iUi

i miUi

i miUi

i miUi

1 146 112.98

2 93 74.95

3 61 46.08

4 55 35.20

5 43 30.40

6 92 29.29

7 71 28.77

8 20 28.76

9 23 28.44

10 23 27.73

11 66 25.72

12 38 25.45

13 28 25.00

14 53 21.33

15 47 18.14

16 23 18.00

17 27 17.62

18 45 17.51

19 80 17.28

20 55 13.79

21 44 13.77

22 30 13.28

23 25 13.10

24 67 13.01

25 12 12.04

26 58 12.03

27 27 11.74

28 64 11.61

29 38 11.38

30 21 10.72

31 53 9.84

32 36 8.95

33 36 8.89

34 38 8.86

35 44 8.02

36 56 7.43

37 46 7.21

38 19 6.14

39 29 4.64

40 35 2.44

where i =group number

mi = number in group i

Ui = -2 log λi

Number of Components 4 5in a Mixture Model

Testing for the number of components, g, in a mixture is an important but very difficult problem which has not been completely resolved.

Order of a Mixture Model 4 5

A mixture density with gcomponents might be empirically indistinguishable from one with eitherfewer than g components or more than g components. It is thereforesensible in practice to approach the question of the number of componentsin a mixture model in terms of an assessment of the smallest number ofcomponents in the mixture compatible with the data.

Likelihood Ratio Test Statistic 4 5

An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2logl.Suppose we wish to test the null hypothesis,

versus

for some g1>g0.

We let denote the MLE of calculated under 4 5Hi , (i=0,1). Then the evidence against H0will be strong if l is sufficiently small, or equivalently, if -2logl is sufficiently large, where

Bootstrapping the LRTS 4 5

McLachlan (1987) proposed a resampling approach to the assessment of theP-value of the LRTS in testing

for a specified value of g0.

Bayesian Information Criterion 4 5

The Bayesian information criterion (BIC) of Schwarz (1978) is given by

as the penalized log likelihood to be maximized in model selection, including the present situation for the number of components g in a mixture model.

Gap statistic 4 5(Tibshirani et al., 2001) Clest (Dudoit and Fridlyand, 2002)

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Supervised Classification (Two Classes) 4 5

. . . . . . .

Sample 1

Sample n

Gene 1

. . . . . . .

Gene p

Class 2

(poor prognosis)

Class 1

(good prognosis)

Microarray to be used as routine clinical screen 4 5

by C. M. Schubert

Nature Medicine

9, 9, 2003.

The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.

Selection bias in gene extraction on the 4 5basis of microarray gene-expression data

Ambroise and McLachlan

Proceedings of the National Academy of Sciences

Vol. 99, Issue 10, 6562-6566, May 14, 2002

http://www.pnas.org/cgi/content/full/99/10/6562

We 4 5OBSERVE the CLASS LABELSz1, …, zn where zj= i if jth tissue sample comes from the ith class (i=1,…,g).

AIM: TO CONSTRUCT A CLASSIFIER c(y) FOR

PREDICTING THE UNKNOWN CLASS LABEL z

OF A TISSUE SAMPLE y.

e.g. g = 2 classes C1 - DISEASE-FREE

C2 - METASTASES

Supervised Classification of Tissue Samples

FORM

for the production of the group label z of a future entity with feature vector y.

FISHER’S LINEAR DISCRIMINANT FUNCTION 4 5

where

and Sare the sample means and pooled sample

and

covariance matrix found from the training data

Vapnik (1995)

whereβ0andβare obtained as follows:

subject to

relate to the slack variables

separable case

with non-zero 4 5

only for those observations jfor which the

constraints are exactly met (the support vectors).

Support Vector Machine (SVM) 4 5

by

REPLACE

where the kernel function

is the inner product in the transformed feature space.

HASTIE et al. (2001, Chapter 12) 4 5

The Lagrange (primal function) is

which we maximize w.r.t. β, β0,andξj.

Setting the respective derivatives to zero, we get

with

and

By substituting 4 5(2) to (4) into (1), we obtain the Lagrangian dual function

We maximize (5) subject to

In addition to (2) to (4), the constraints include

Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem.

Leo Breiman (2001) 4 5Statistical modeling: the two cultures (with discussion).Statistical Science 16, 199-231.Discussants include Brad Efron and David Cox

LEUKAEMIA DATA:

Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)

COLON DATA:

Using only 4 genes, CVE is 2%

Since 4 5p>>n, consideration given to selection of suitable genes

SVM: FORWARD or BACKWARD (in terms of magnitude of weight βi)

RECURSIVE FEATURE ELIMINATION (RFE)

FISHER: FORWARD ONLY (in terms of CVE)

LEUKAEMIA DATA:

Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)

COLON DATA:

Using only 4 genes, CVE is 2%

“The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”

Suppose there are two groups G1 andG2

c(y)is a classifier formed from the data set

(y1, y2, y3,……………, yn)

The apparent error is the proportion of the data set misallocated byc(y).

From the original data set, remove 4 5y1to give the reduced set

(y2, y3,……………, yn)

Cross-Validation

Then form the classifierc(1)(y )from this reduced set.

Use c(1)(y1)to allocate y1 to either G1 or G2.

Repeat this process for the second data point, 4 5y2.

So that this point is assigned to either G1 or G2 on the basis of the classifier c(2)(y2).

And so on up to yn.

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data

Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues

ADDITIONAL REFERENCES noninformative samples generated by random permutations of the class labels of the colon tumor tissues

Selection bias ignored:

XIONG et al. (2001, Molecular Genetics and Metabolism)

XIONG et al. (2001, Genome Research)

ZHANG et al. (2001, PNAS)

Aware of selection bias:

SPANG et al. (2001, Silico Biology)

WEST et al. (2001, PNAS)

NGUYEN and ROCKE (2002)

BOOTSTRAP APPROACH noninformative samples generated by random permutations of the class labels of the colon tumor tissues

Efron’s (1983, JASA) .632 estimator

where B1 is the bootstrap when rule is applied to a point not in the training sample.

A Monte Carlo estimate of B1 is

where

Toussaint & Sharpe (1975) proposed the noninformative samples generated by random permutations of the class labels of the colon tumor tissues

ERROR RATE ESTIMATOR

where

McLachlan (1977) proposed w=wowhere wo is chosen to minimize asymptotic bias of A(w)in the case of two homoscedastic normal groups.

Value of w0was found to range between 0.6 and 0.7, depending on the values of

.632+ estimate of Efron & Tibshirani (1997, noninformative samples generated by random permutations of the class labels of the colon tumor tissuesJASA)

where

(relative overfitting rate)

(estimate of no information error rate)

If r = 0, w = .632,and soB.632+ = B.632

r = 1, w = 1, and so B.632+ = B1

MARKER GENES FOR HARVARD DATA noninformative samples generated by random permutations of the class labels of the colon tumor tissues

For a SVM based on 64 genes, and using 10-fold CV,

we noted the number of times a gene was selected.

No. of genes Times selected

55 1

18 2

11 3

7 4

8 5

6 6

10 7

8 8

12 9

17 10

MARKER GENES FOR HARVARD DATA noninformative samples generated by random permutations of the class labels of the colon tumor tissues

No. of Times

genes selected

55 1

18 2

11 3

7 4

8 5

6 6

10 7

8 8

129

17 10

Breast cancer data set in noninformative samples generated by random permutations of the class labels of the colon tumor tissuesvan’t Veer et al.

(van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415)

These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours.

The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.

van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)

Nearest-Shrunken Centroids cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)

(Tibshirani et al., 2002)

The usual estimates of the class means are shrunk toward the

overall mean of the data, where

and

The nearest-centroid rule is given by cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)

where yvis the vth element of the feature vector y and .

In the previous definition, we replace the sample mean of

the vth gene by its shrunken estimate

v

where

Comparison of Nearest-Shrunken Centroids with SVM of

Apply (i) nearest-shrunken centroids and

(ii) the SVM with RFE

to colon data set of Alon et al. (1999), with

N = 2000 genes and M = 62 tissues

(40 tumours, 22 normals)

Nearest-Shrunken Centroids applied to Alon data of

(a) Overall Error Rates

(b) Class-specific Error Rates

- Introduction to Microarray Technology

- Detecting Differentially Expressed Genes in Known Classes
- of Tissue Samples

- Cluster Analysis: Clustering Genes and Clustering Tissues

- Supervised Classification of Tissue Samples

- Linking Microarray Data with Survival Analysis

Breast tumours have a genetic signature. The expression of

pattern of a set of 70 genes can predict whether a tumour

is going to prove lethal, despite treatment, or not.

“This gene expression profile will outperform all currently

used clinical parameters in predicting disease outcome.”

van ’t Veer et al. (2002), van de Vijver et al. (2002)

Problems of

- Censored Observations – the time of occurrence of the event
- (death) has not yet been observed.
- Small Sample Sizes – study limited by patient numbers
- Specific Patient Group – is the study applicable to other
- populations?
- Difficulty in integrating different studies (different
- microarray platforms)

A Case Study: The Lung Cancer data sets from CAMDA’03 of

Four independently acquired lung cancer data sets

(Harvard, Michigan, Stanford and Ontario).

The challenge: To integrate information from different

data sets (2 Affy chips of different versions, 2 cDNA arrays).

The final goal: To make an impact on cancer biology and

eventually patient care.

“Especially, we welcome the methodology of survival analysis

using microarrays for cancer prognosis (Park et al.

Bioinformatics: S120, 2002).”

Methodology of Survival Analysis using Microarrays of

Cluster the tissue samples (eg using hierarchical clustering), then

compare the survival curves for each cluster using a non-parametric

Kaplan-Meier analysis (Alizadeh et al. 2000).

Park et al. (2002), Nguyen and Rocke (2002) used partial least

squares with the proportional hazards model of Cox.

Unsupervised vs. Supervised Methods

Semi-supervised approach of Bair and Tibshirani (2004), to combine

gene expression data with the clinical data.

AIM of : To link gene-expression data with survival from lung cancer

in the CAMDA’03 challenge

A CLUSTER ANALYSIS

We apply a model-based clustering approach to classify tumour tissues on the basis of microarray gene expression.

B SURVIVAL ANALYSIS

The association between the clusters so formed and patient survival (recurrence) times is established.

C DISCRIMINANT ANALYSIS

We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.

Lung Cancer of

Approx. 80% of lung cancer patients have NSCLC (of which

adenocarcinoma is the most common form).

All Patients diagnosed with NSCLC are treated on the basis of

stage at presentation (tumour size, lymph node involvement and

presence of metastases).

Yet 30% of patients with resected stage I lung cancer will die of

metastatic cancer within 5 years of surgery.

Want a prognostic test for early-stage lung adenocarcinoma to

identify patients more likely to recur, and therefore who would

benefit from adjuvant therapy.

(see http://www.camda.duke.edu/camda03)

Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001),

Beer et al. (2002).

Heat Maps for the 20 Ontario Gene-Groups (39 Tissues) of

Genes

Tissues

Tissues are ordered as:

Recurrence (1-24) and Censored (25-39)

Expression Profiles for Useful Metagenes (Ontario 39 Tissues)

Gene Group 1

Gene Group 2

Our Tissue Cluster 1

Our Tissue Cluster 2

Log Expression Value

Recurrence (1-24)

Censored (25-39)

Gene Group 19

Gene Group 20

Tissues

Tissue Clusters Tissues)

CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS:

CLUSTER 1 (38): 23(recurrence) plus

8 (censored)

CLUSTER 2 (8): 1(recurrence) plus

7(censored)

Poor-prognosis

Good-prognosis

SURVIVAL ANALYSIS: Tissues)

LONG-TERM SURVIVOR (LTS) MODEL

whereT is time to recurrenceandp1=1- p2 is the prior prob. of recurrence.

Adopt Weibull model for the survival function for

recurrence S1(t).

Fitted LTS Model vs. Kaplan-Meier Tissues)

Cluster-Specific Kaplan-Meier Plots Tissues)

Survival Analysis for Ontario Dataset Tissues)

- Nonparametric analysis:

A significant difference between Kaplan-Meier estimates for the two clusters (P=0.027).

- Cox’s proportional hazards analysis:

Discriminant Analysis (Supervised Classification) Tissues)

A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias.

A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes.

A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.

ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE)

0.12

0.1

0.08

Error Rate (CV10E)

0.06

0.04

0.02

0

0

2

4

6

8

10

12

log2 (number of genes)

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38; G2: 15,30-32,34,35,37,39)

STANFORD DATA Recursive Feature Elimination (RFE)

918 genes based on 73 tissue samples from 67 patients.

Row and column normalized, retained 451 genes after

select-genes step. Used 20 metagenes to cluster tissues.

Retrieved histological groups.

Heat Maps for the 20 Stanford Gene-Groups (73 Tissues) Recursive Feature Elimination (RFE)

Genes

Tissues

Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73)

STANFORD CLASSIFICATION: Recursive Feature Elimination (RFE)

Cluster 1: 1-19 (good prognosis)

Cluster 2: 20-26 (long-term survivors)

Cluster 3: 27-35 (poor prognosis)

Heat Maps for the 15 Stanford Gene-Groups (35 Tissues) Recursive Feature Elimination (RFE)

Genes

Tissues

Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)

Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)

Gene Group 1

Gene Group 2

StanfordAC group 1

StanfordAC group 2

StanfordAC group 3

Misallocated

Log Expression Value

Gene Group 4

Gene Group 3

Tissues

Cluster-Specific Kaplan-Meier Plots Tissues)

Cluster-Specific Kaplan-Meier Plots Tissues)

Survival Analysis for Stanford Dataset Tissues)

- Kaplan-Meier estimation:

A significant difference in survival between clusters (P<0.001)

- Cox’s proportional hazards analysis:

Survival Analysis for Stanford Dataset Tissues)

- Univariate Cox’s proportional hazards analysis (metagenes):

Survival Analysis for Stanford Dataset Tissues)

- Multivariate Cox’s proportional hazards analysis (metagenes):

The final model consists of four metagenes.

STANFORD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE)

0.07

0.06

0.05

0.04

Error Rate (CV10E)

0.03

0.02

0.01

0

0

1

2

3

4

5

6

7

8

9

10

log2 (number of genes)

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). Applied to g=2 clusters.

- CONCLUSIONS Feature Elimination (RFE)
- We applied a model-based clustering approach to
- classify tumors using their gene signatures into:
- clusters corresponding to tumor type
- clusters corresponding to clinical outcomes for tumors of a given subtype
- In (a), almost perfect correspondence between
- cluster and tumor type, at least for non-AC
- tumors (but not in the Ontario dataset).

CONCLUSIONS (cont.) Feature Elimination (RFE)

The clusters in (b) were identified with clinical outcomes (e.g. recurrence/recurrence-free and death/long-term survival).

We were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage.

CONCLUSIONS (cont.) Feature Elimination (RFE)

Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients.

This supervised classification procedure was used to provide marker genes for prediction of clinical outcomes.

(In addition to those provided by the cluster-genes step in the initial unsupervised classification.)

LIMITATIONS Feature Elimination (RFE)

Small numberof tumors available (e.g Ontario and Stanford datasets).

Clinical data available for only subsets of the tumors; often for only one tumor type (AC).

High proportion of censored observations limits comparison of survival rates.

Download Presentation

Connecting to Server..