practical issues in microarray data analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Practical Issues in Microarray Data Analysis PowerPoint Presentation
Download Presentation
Practical Issues in Microarray Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 22
Download Presentation

Practical Issues in Microarray Data Analysis - PowerPoint PPT Presentation

Download Presentation

Practical Issues in Microarray Data Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland

  2. Overview • Scales for analysis • Systematic errors • Sample outliers & experimental consistency • Useful graphics • Implications for experimental design • Platform consistency • Individual differences

  3. Distribution of Signals • Most genes are expressed at very low levels • Even after log-transform the distribution is skewed • NB: Signal to abundance ratio NOT the same • for different genes on the chip

  4. Explanation of Distribution Shape • Left hand steep bell curve probably due to measurement noise • Underlying real distribution probably even steeper + = abundances + noise = observed values

  5. Variation Between Chips • Technical variation: differences between measures of transcript abundance in same samples • Causes: • Sample preparation • Slide • Hybridization • Measurement • Individual variation: variation between samples or individuals • Healthy individuals really do have consistently different levels of gene expression!

  6. Replicates in True Scale • Signals vary more between replicates at high end • Level of ‘noise’ increases with signal Std Dev as a function of signal across all chips Comparison of chips (Affy) chip 1 SD chip 2 mean signal Red line is lowess fit

  7. Replicates on Log Scale • Measures fold-change identically across genes • Noise at lower end is higher in log transform chip 1 vs chip 2 after log transform SD vs signal after log transform

  8. Ratio-Intensity (R-I) plots • Log scale makes it convenient to represent fold-changes up or down symmetrically • R = log(Red/Green); I = (1/2)log(Red*Green) • aka. MA (minus, add) plots (log) Ratio (log) Intensity

  9. Variance Stabilization • Simple power transforms (Box-Cox) often nearly stabilize variance • Durbin and Huber derived variance-stabilizing transform from a theoretical model: • y = a (background) + m eh (mult. error) + e (static error) • m is true signal; h and e have N(0,s) distribution • Transform: • Could estimate a (background) and sh/se empirically • In practice often best effect on variance comes from parameters different from empirical estimates • Huber’s harder to estimate

  10. Box-Cox Transforms • Simple power transformations (including log as extreme case), eg cube root • Often work almost as well as variance-stabilizing transform

  11. Should you use Transforms? • Transforms change the list of genes that are differentially regulated • The common argument is that bright genes have higher variability • However you aren’t comparing different genes • Log transform expands the variability of repressed genes • Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers) • Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)

  12. Graphical methods • Aims: • Exploratory analysis, to see natural groupings, and to detect outliers • To identify combinations of features that usefully characterize samples or genes • Not really suitable for quantitative measures of confidence • Principal Components Analysis (PCA) • Standard procedure of finding combinations with greatest variance • Multi-dimensional scaling (MDS) • Represent distances between samples as a two- or three-dimensional distance • Easy to visualize

  13. MDS Plots

  14. Representing Groups Day 1 Chips Cluster diagram Multi-dimensional scaling

  15. Different Metrics – Same Scale • 8 tumor; 2 normal tissue samples • Distances are similar in each tree • Normals close • Tree topologies appear different • Take with a grain of salt!

  16. Volcano Plot • Displays both biological importance and statistical significance log2(p-value) or t-score log2(fold change)

  17. Quantile Plot • Plot sample t-scores against t-scores under random hypothesis • Statistically significant genes stand out Sample t-scores Corresponding quantiles of t-distribution

  18. Systematic Variation • Intensity-dependent dye bias due to ‘quenching’ • Stringency (specificity) of hybridization due to ionic strength of hyb solution • How far hybridization reaction progresses due to variation in mixing efficiency • Spatial variation in all of the above

  19. Relevance for Experimental Designs • Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill • Assuming error is random! • In practice very hard to deal with systematic errors in a symmetric design • No two slides with comparable fold-changes Sample 1 Sample 5 Sample 2 Sample 4 Sample 3

  20. Critique of Optimal Designs • Optimal for reduction of variance, if • All chips are good quality • No systematic errors – only random noise • In fact systematic error is almost as great as random noise in many microarray experiments • With loop designs single chip failures cause more loss of information than with reference designs

  21. Individual Variation • Numerous genes show high levels of inter-individual variation • Level of variation depends on tissue also • Donors, or experimental animals may be infected, or under social stress • Tissues are hypoxic or ischemic for variable times before freezing

  22. Frequent False Positives • Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples • Permutation p-values will be insignificant, even if t-score appears large Group 1 Group 2 frequency gene levels