101 Views

Download Presentation
## Practical Issues in Microarray Data Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Practical Issues in Microarray Data Analysis**Mark Reimers National Cancer Institute Bethesda Maryland**Overview**• Scales for analysis • Systematic errors • Sample outliers & experimental consistency • Useful graphics • Implications for experimental design • Platform consistency • Individual differences**Distribution of Signals**• Most genes are expressed at very low levels • Even after log-transform the distribution is skewed • NB: Signal to abundance ratio NOT the same • for different genes on the chip**Explanation of Distribution Shape**• Left hand steep bell curve probably due to measurement noise • Underlying real distribution probably even steeper + = abundances + noise = observed values**Variation Between Chips**• Technical variation: differences between measures of transcript abundance in same samples • Causes: • Sample preparation • Slide • Hybridization • Measurement • Individual variation: variation between samples or individuals • Healthy individuals really do have consistently different levels of gene expression!**Replicates in True Scale**• Signals vary more between replicates at high end • Level of ‘noise’ increases with signal Std Dev as a function of signal across all chips Comparison of chips (Affy) chip 1 SD chip 2 mean signal Red line is lowess fit**Replicates on Log Scale**• Measures fold-change identically across genes • Noise at lower end is higher in log transform chip 1 vs chip 2 after log transform SD vs signal after log transform**Ratio-Intensity (R-I) plots**• Log scale makes it convenient to represent fold-changes up or down symmetrically • R = log(Red/Green); I = (1/2)log(Red*Green) • aka. MA (minus, add) plots (log) Ratio (log) Intensity**Variance Stabilization**• Simple power transforms (Box-Cox) often nearly stabilize variance • Durbin and Huber derived variance-stabilizing transform from a theoretical model: • y = a (background) + m eh (mult. error) + e (static error) • m is true signal; h and e have N(0,s) distribution • Transform: • Could estimate a (background) and sh/se empirically • In practice often best effect on variance comes from parameters different from empirical estimates • Huber’s harder to estimate**Box-Cox Transforms**• Simple power transformations (including log as extreme case), eg cube root • Often work almost as well as variance-stabilizing transform**Should you use Transforms?**• Transforms change the list of genes that are differentially regulated • The common argument is that bright genes have higher variability • However you aren’t comparing different genes • Log transform expands the variability of repressed genes • Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers) • Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)**Graphical methods**• Aims: • Exploratory analysis, to see natural groupings, and to detect outliers • To identify combinations of features that usefully characterize samples or genes • Not really suitable for quantitative measures of confidence • Principal Components Analysis (PCA) • Standard procedure of finding combinations with greatest variance • Multi-dimensional scaling (MDS) • Represent distances between samples as a two- or three-dimensional distance • Easy to visualize**Representing Groups**Day 1 Chips Cluster diagram Multi-dimensional scaling**Different Metrics – Same Scale**• 8 tumor; 2 normal tissue samples • Distances are similar in each tree • Normals close • Tree topologies appear different • Take with a grain of salt!**Volcano Plot**• Displays both biological importance and statistical significance log2(p-value) or t-score log2(fold change)**Quantile Plot**• Plot sample t-scores against t-scores under random hypothesis • Statistically significant genes stand out Sample t-scores Corresponding quantiles of t-distribution**Systematic Variation**• Intensity-dependent dye bias due to ‘quenching’ • Stringency (specificity) of hybridization due to ionic strength of hyb solution • How far hybridization reaction progresses due to variation in mixing efficiency • Spatial variation in all of the above**Relevance for Experimental Designs**• Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill • Assuming error is random! • In practice very hard to deal with systematic errors in a symmetric design • No two slides with comparable fold-changes Sample 1 Sample 5 Sample 2 Sample 4 Sample 3**Critique of Optimal Designs**• Optimal for reduction of variance, if • All chips are good quality • No systematic errors – only random noise • In fact systematic error is almost as great as random noise in many microarray experiments • With loop designs single chip failures cause more loss of information than with reference designs**Individual Variation**• Numerous genes show high levels of inter-individual variation • Level of variation depends on tissue also • Donors, or experimental animals may be infected, or under social stress • Tissues are hypoxic or ischemic for variable times before freezing**Frequent False Positives**• Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples • Permutation p-values will be insignificant, even if t-score appears large Group 1 Group 2 frequency gene levels