Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA. golub@genome.wi.mit.edu

Contents • Background • Objective • Methods • Results and GeneCluster • Conclusion

Background: Cancer Classification • Cancer Classification has been primarily based on morphological appearance of tumor. • Limitations of morphology classification: tumors of similar histopathological appearance can have significantly different clinical courses and response to therapy. • Cancer classification has been difficult because it has historically relied on specific biological insights, rather than systematic unbiased approaches for recognizing tumor subtypes.

Background: Cancer Classification • No general approach has been made for identifying new cancer classes (class discovery) and assigning tumors to known classes(class prediction). • A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to acute human leukemia as a test case.

Background: Cancer Classification • Cancer classification can be divided into two challenges: class discovery and class prediction. • Class discovery refers to definingpreviously unrecognized tumor subtypes. • Class prediction refersto the assignment of particular tumor samples to already-definedclasses. • We will look at class prediction for classifying Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia(ALL).

Background:Leukemia • Acute leukemia: Cancer of bone marrow. • Marrow cannot produce appropriate amount of red and white blood cells. • Subtypes: acute Lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML) • Particular subtypes of acute leukemia have beenfound to be associated with specific chromosomal translocations. • ALL : t(12;21)(p13;q22) in 25% of patients. • AML: t(8;21)(q22;q22) in 15% of patients.

Background:Leukemia • ALL : 58 % survival rate • AML: 14% survival rate • Distinction between ALL and AML essential for successful treatment. • Treatment: chemotherapy, bone marrow transplant. • ALL : corticosteroids, vincristine, methotrexate, L-asparaginase • AML :daunorubicin, cytarabine • Although distinction between ALL and AML is well established, no single test is currently sufficient to establishthe diagnosis, but a combination of different tests in morphology,histochemistry and immunophenotyping are performed in an specialized laboratory. • Althoughusually accurate, leukemia classification remains imperfect anderrors do occur.

Objective • To develop a more systematic approach to cancer classification based on the simultaneous expression monitoringof thousands of genes using DNA microarrays with leukemia as test cases.

Overview Expression Data Known classes Asses gene class correlation: Neighborhood Analysis Build Predictor Test Predictor by cross-validation Test Predictor on Independent Dataset

Method: Biological Samples • Primary samples: 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients atthe time of diagnosis; • Measured the expression of 6817 human genes using Affymetrix arrays.

Method: Microarray • RNA prepared from cells was hybridized to high-density oligonucleotide Affymetrix microarrays containing probes for6817 human genes. • Samples were subjected to a priori quality control standards regarding the amount of labeled RNA and the quality of the scanned Microarray image.” Eight of 80 leukemia samples were discarded. Samples were rejected if : • they yielded less than 15 microgms of biotinylated RNA • weak hybridization • or if there were visible defects in the array(such as scratches).

General method for prediction: • Determine whether there were genes whose expression pattern was strongly correlated with class distinction. • Given a collection of known samples, create a class predictor. • Test for validity of predictors.

Basic Definitions • Each gene is represented by an expression vector v(g) = ( e1, e2,…….,en) Where ei denotes the expression level of gene g in ith sample . • A class distinction is represented by an idealized expression pattern c=(c1,c2,………,cn) where ci is 1 or 0 according to whether the ith sample belongs to class 1 or class 2. • “Correlation” between a gene and a class distinction can be measured in a variety of ways.

Correlation • A measure of correlation, P(g,c) is used in this paper • P(g,c) = A measure of Correlation measuring the degree to which expression of a given gene in the set of samples correlates with assignment to either class (AML or ALL). • [ μ1(g),σ1(g)] and[ μ2(g),σ2(g)] denote the means and SD’s of the log of expression levels of gene g for the samples in class 1 and class2 . • P(g,c) = [m1(g) –m2(g) ]/[s1(g) +s2(g)] P(g,c) captures seperation between 2 classes and the variation within individual classes.

Correlation Cont. • Large values of | P(g,c) | indicate a strong correlation between gene expression and the class distinction. • P(g,c) being positive or negative corresponds to gene being more highly expressed in class1 or class2.

Method: Neighborhood Analysis

Method: Neighborhood Analysis • This method helps establish whether the observed correlation were stronger than would be expected by chance. • One defines an "idealized expression pattern" ‘c’ correspondingto a gene that is uniformly high in one class and uniformly lowin the other. • One tests whether there is an unusually high densityof genes "nearby" (or similar to) this idealized pattern,as compared to equivalent random patterns.

Each gene is represented by an expression vector, consisting of its expression level in each of the tumor samples. • In the fig. The data set is composed of 6 ALLs and 6 AMLs • Gene g1 is well correlated with class distinction, whereas gene g2 is poorly correlated.

Neighborhood analysis involves counting the number of genes having various levels of correlation with c. • The results are compared to the corresponding distribution obtained for random idealized expression patterns c*, obtained by randomly permuting the coordinates of c. • Unusually high density of genes indicates that there are many more genes correlated with the pattern than expected by chance.

For the 38 acute leukemia samples in the initial data set, neighborhood analysis showed that roughly 1100 genes were more highly correlated with theAML-ALL class distinction than would be expected by chance (Fig. 2). This suggested that classification could indeedbe based on expression data.

High in ALL: the number of genes with correlation P(g,c) > 0.30 was 709 for the AML-ALL distinction. • Note that P(g,c) = 0.30 is the point where the observed data intersects the 1% significance level.

Similarly, in the right panel (higher in AML), 711 genes with P(g,c) > 0.28 were observed.

Choosing a prediction set S • Once the neighborhood analysis graph shows that there are genes significantly correlated with class distinction c , the next step is to choose a set of genes for prediction. • The goal is to choose a set S of k genes such that most genes in S are likely to be predictive of the class distinction in future samples.

One could simply choose the top k genes by the |P(g,c)|, but there is a possibility that, for e.g., all genes might be expressed at a high level in class1 and a low level(or not at all) in class 2. • They found that the predictors often do better when they include some genes expressed at high levels in each class.

Choosing S cont’d • So, they performed separate neighborhood analysis for positive and negative P(g,c) scores • and selected the top k1 genes( highly expressed in class 1) • and the bottom k2 genes(highly expressed in class 2).

Constraints on selection of k1 and k2. • In order to determine how sensitive our prediction method is to the exact genes used and to choose in choosing k1 andk2 accordingly. • There are several constraints :- They wanted to limit the number of genes to those shown to be significantly correlated by the permutation test, perhaps at the 1% level.  Furthermore, suppose that there were a single gene whose expression level was perfectly correlated with the class distinction in the available data.

We still might like a predictor that includes > 1 gene, in order to provide robustness against noise and to allow to estimate prediction accuracy (described later). • Additional genes may be active in different biological pathways or may provide independent estimates of activity along the same pathway; in either case they can add information that can be combined to improve prediction. • On the other hand..

On the other hand, this argument cannot be extended indefinitely. • If there are 1000 genes that are significantly correlated with a class distinction, it’s unlikely that they all represent different biological mechanisms. • Their expression patterns are probably dependent, so that the thousandth gene would be unlikely to add information not already provided by the previous 999, but it would add noise to the system

In general, then, there is a tradeoff between • the amount of additional information and robustness gained by adding more genes, • and the amount of noise added.

Therefore, optimal size of the prediction set is likely to vary somewhat due to the genetics of the class being predicted . (i.e. whether there are many independent classes of genes correlated with the class distinction,or just a single co-regulated pathway; whether many genes or few).

Choosing k1 and k2 • They evaluated two methods for choosing k1 and k2 :- 1) First method chooses all genes above the 1% significance level in neighborhood analysis, but sets a maximum of 50 genes in each direction to avoid being dominated by noise. 2) The second method tries many different values for |S|,with constraint that k1 and k2 are roughly equal. • Preliminary results comparing the two methods indicate that they provide roughly equivalent predictive ability.

Questions? • How do we use the chosen genes to predict? • How to evaluate the method’s success?

After the selection of S , the next step is to try predicting new samples. • They assumed that they have a set of samples called the training set whose correct classifications are already known. • And a test set of additional samples whose classes are currently unknown, at least to the algorithm.

Prediction by weighted voting • To determine the classification of a new sample in the test set, they used a simple weighted voting scheme. • Each gene in S gets to cast its vote for exactly one class. • The gene’s vote on a new sample x is weighted by how closely its expression in the training set correlates with c. • The vote is the product of this weight and a measure of how informative the gene appears to be for predicting the new sample.

Intuitively, we’d expect the gene’s expression in x to look like that of either a typical class-1 sample or a typical class-2 sample in the training set. • So, they compared the expression in the new sample to the class means in the training set.

They defined a “decision boundary” b (average of the means) halfway between the two class means. • The vote corresponds to the distance between b and the gene’s expression in the new sample. • Fig.

So, each gene casts a weighted vote V= weight(g).distance(x,b). • They used P(g,c) as weight(g). • The weights are defined so that +ve votes->Class1 -ve votes-> class2.

Explanation on Weighted Voting. • Consider a gene represented by gene expression vector g, • Let x be the raw expression level of that gene in a new sample whose class we want to predict. • Let x~ = log10 x, and let g~= (log10(g1),…, log10(gn)). • Let  and  represent the mean and standard deviation of g~ . • Normalized g~norm=((g~1- )/ ,…, (g~n- )/ ) and x~norm = (x~- )/  . • (Note that x~norm is normalized by the mean and standard deviation over the training set only.)

Class mean for g~norm : 1 = ((I  Class1)g~norm )/|class 1| and 2 similarly. • b=(1+ 2)/2 • V= P(g,c)(x~norm – b). • Positive votes are counted as votes for the new sample’s membership in class1, negative votes for class2.

To see why this works? • Recall, P(g,c) = (1- 2 )/(1+ 2,). • So, P(g,c) is positive if 1> 2 and negative if 1< 2. • Suppose, that 1> 2 , • If x looks like a typical sample in class 1, x~norm > b, so x~norm – b > 0 and the weighted vote will be positive. • If x looks like a typical sample in class 2, x~norm <b, so x~norm – b < 0 and the weighted vote will be negative. • A similar argument holds if 1< 2 : the signs of (x~norm – b) are reversed but cancel with the negative sign of P(g,c), so the weighted votes are still positive for class 1 and negative for class 2.

The votes for all genes in S combined; V1 is the sum of all positive votes. V2 is the sum of all negative votes. • The winner, and the direction of prediction, is simply the class receiving the larger total vote.

Intuitively, If one class receives most of the votes and the other class has only a token representation, it would be reasonable to predict with the majority. However, if the margin of victory is slight, a prediction for the majority class seems somewhat arbitrary and can only be done with a low confidence. • They, therefore defined the “ prediction strength” (PS) to measure the margin of victory.

Prediction Strength(PS) • PS= (Vwinner – Vloser)/(Vwinner + Vloser). Since Vwinner is always greater than Vloser, PS varies between 0 and 1. Note: Typical prediction strengths for incorrect predictions tend to be much lower than those for correct predictions.

The tradeoff between reliability and useful • Thus, the PS measure provides a quantitative way of defining a tradeoff between a “reliable” predictor (one that is always almost correct but sometimes refuses to predict), • and a “useful” predictor(one that makes a prediction in every case, but may be incorrect sometimes).

When to use reliable predictor or useful predictor • Useful predictor: If it is essential to make a prediction every time, one can always predict in favor of the winning class, however small the margin of victory.

Reliable predictor • On the other hand if the cost of an incorrect prediction is high, one can choose a PS threshold below which predictions are not made. For example:-When the predictions are used to diagnose or direct treatment of cancer patients, an incorrect prediction could have potentially devastating results. In this case, they choose a conservatively high PS threshold(0.3) to minimize the chance of making an incorrect diagnosis .(and potentially treating a patient with the wrong chemotherapy regime).

Evaluating the method • The preceding discussion suggests two criteria for evaluating a predictor:  The error rate, or percentage of incorrect predictions from the total number of predictions made.  The “no-call” rate, or the percentage of samples for which no prediction was made (due to a PS below the threshold).

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring