1 / 39

David Elashoff UCLA Department of Biostatistics

Impact of the Choice of Expression Metric on the Standard Statistical Analysis of Oligonucleotide Microarray Data. David Elashoff UCLA Department of Biostatistics. Outline. Introduction to Affymetrix Microarrays Data Preprocessing Methods Within metric comparisons

kimberly
Download Presentation

David Elashoff UCLA Department of Biostatistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impact of the Choice of Expression Metric on the Standard Statistical Analysis of Oligonucleotide Microarray Data David Elashoff UCLA Department of Biostatistics

  2. Outline • Introduction to Affymetrix Microarrays • Data Preprocessing Methods • Within metric comparisons • Between metric comparisons • Results of a standard statistical analysis

  3. DNA Microarrays in Publications • 1080 papers on the analysis of microarray data. (1997-2005) • 8052 papers specific to gene expression microarrays. (1995-2005)

  4. Data Preprocessing • Five major techniques (MAS 4.0, MAS 5.0, Dchip (PMonly, Diff), RMA) + a number of newer techniques (SUM, PDNN, GCRMA, others) • Currently there is no agreement in the scientific community as to which method should be used.

  5. Expression Metrics

  6. Our project • How much does the choice of method impact the results of the data analysis? • Methods: 14 Human 133A data sets that are two group comparisons. • For each data set compute the expression indices using each of the 5 data preprocessing techniques. • With each of these 70 data sets perform a standard statistical analysis and compare the results. (~30 million values)

  7. Data Sets

  8. Within Metric Comparison • The first step is to examine how the different methods perform. • We compute for each gene in each data set using each method: 1. Two sample t-statistic 2. Fold-Change 3. Overall Mean 4. Sp: the within group variance estimator.

  9. Within Metric Comparisons

  10. Within Metric Comparisons

  11. Within Metric Comparisons

  12. Scatter plots between rank percentage of mean expression of all genes in the two groups

  13. Within Metric Comparinson Conclusion • RMA appears to have a number of properties that make it a better estimator. 1. Uncorrelated Mean and test statistic. 2. Uncorrelated Mean and standard deviation 3. Less difference in ranks overall between groups.

  14. Plots and Spearman’s correlations of Mean expression measure

  15. Hierarchical tree of expression values based on their correlations; a) tree shape is for all data sets

  16. Standard Statistical Analysis • Two main components • Gene Filtering • Clustering/Classification • Gene Filtering uses combinations of comparison statistics to identify a small number of differentially expressed genes • Clustering/Classification uses combinations of genes to develop prediction models.

  17. Gene Filtering • Wide variety of criteria • Statistical Tests (t-test / ANOVA / Regression / Survival) • Fold Change(FC), • Confidence interval for the fold change • Absent/Present Call • Absolute Difference • Significance Analysis of Microarrays (SAM) • Much literature on controlling false positive rates or false discovery rates (FDR).

  18. Correlation of T-statistics

  19. Hierarchical tree of t-statistics based on their correlations; a) tree shape is for five data sets b) and c) in one set each

  20. Assessing Agreement between methods • For each data set and each method and each test statistic (t-statistic and fold change) we find the subset with the 200 largest t-statistics or FC values.

  21. Average % (Min~Max) of significant genes by the cut-off value used for testing

  22. Agreement of Gene Lists The matrix of the % of average agreement on the most significant 200 genes Identified by t-statistics of each expression measure

  23. Results of Test Statistic Comparison

  24. Average agreement over 5 expression metrics on top 1000 (4.5%) significant genes – In each cell, # of genes (% of genes) agreed by the column # of expression when the row statistics used.

  25. Plots of rank percentage of mean expression of 200 significant gene sets in the two groups

  26. Gene Filtering Conclusions • This is a nightmare in terms of reproducibility. • Overall the methods are not identifying genes in different regions of the expression spectrum. • We know, that all methods produce gene lists that can be confirmed via RT-PCR* *Rosati,B., Frau,F., Kuehler,A., Rodriguea,S. and Mckinnon,D. (2004) Comparison of different probe-level analysis techniques for oligonucleotide microarrays. BioTechniques Vol. 36, 2:316-322

  27. SAM • Can we compare the “quality” of the results between methods? • SAM is based on the permutation test • Using a variable cut-off it computes the FDR for varying numbers of “significant genes” • Does not function well on all data sets. • We used 40 data sets, 25 giving results and 18 with sufficient sample size to work well.

  28. SAM: 25 data sets

  29. SAM: 18 data sets

  30. Clustering/Classification • Currently there is a huge literature on the application of every multivariate statistical method to the analysis of microarray data. • The techniques fall into two philosophical categories, unsupervised and supervised learning. • Typically we want to determine whether the microarray data can produce a classifier that correctly predicts the true classes. • There is no clear agreement on how many genes should be used for these methods

  31. Unsupervised Learning (Clustering) • In each of the five expression indices for each of the seven data sets, samples are partitioned into two groups using the K-means clustering (we set K=2). • For the K-means clustering we use various subsets of the 22283 genes corresponding to typical gene filtering criteria. • 1) the subset of genes that are present in at least one sample (typically 5000-15000 genes) • 2) the subset of 5000 with the largest coefficient of variation (CV) • 3) the subset with the top 1000 CVs. • The Rand index is used to measure the level of agreement between predicted group assignment and the true group information.

  32. Results of Clustering Comparison

  33. Supervised Learning (Classification) • We use standard classification method, k-nearest neighbor (KNN) classification assess the ability of the methods to produce gene expression information that can accurately classify the samples from each data set. • 1. Exclude one sample to be used as a test case. Next, we find the top x genes based on t-statistics computed from the remaining samples. • 2. These genes form a new x dimensional space. Within that space, we compute the Euclidean distance between the left out sample and all other samples. • 3. The KNN classification rule assigns the left out sample to the class represented by a majority of its ‘k’ (k=3) nearest neighbors. • 4. This process is then iterated for each of the samples in the data set.

  34. Leave-One-Out Cross-Validation in KNN (k=3) –Accuracy rate (%) with n-1 genes

  35. Leave-One-Out Cross-Validation in KNN (k=3) –Accuracy rate (%) with top 1000 genes

  36. Clustering Conclusions • No method gives consistently better results. • The number of genes used does not seem to matter. • There is an apparent advantage for MAS5 although not enough to make any real conclusions. • Interesting that each method appears to be producing information the can appropriately group the samples.

  37. Final Conclusions • There is a large difference in the gene filtering results from each method. • We have no reason to think that one method is giving results that are more biologically relevant than another. • What do we do now?

  38. Acknowledgements • Myungshin Oh • Fiona O’Kirwan • Nik Brown • Steve Horvath

More Related