Statistical analysis of expression data:
560 likes | 824 Views
Statistical analysis of expression data:. Normalization, differential expression and multiple testing Jelle Goeman. Outline. Normalization Expression variation Modeling the log Fold change Complex designs Shrinkage and empirical Bayes (limma) Multiple testing (False Discovery Rate).
Statistical analysis of expression data:
E N D
Presentation Transcript
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman
Outline • Normalization • Expression variation • Modeling the log Fold change • Complex designs • Shrinkage and empirical Bayes (limma) • Multiple testing (False Discovery Rate)
Platforms • Microarrays • RNAseq • Common: • Need for normalization • Batch effects
Why normalization • Some experimental factors cannot be completely controlled • Amount of material • Amount of degradation • Print tip differences • Quality of hybridization • Effects are systematic • Cause variation between samples and between batches
What is normalization? • Normalization = An attempt to get rid of unwanted systematic variation by statistical means • Note 1: this will never completely succeed • Note 2: this may do more harm than good • Much better, but often impossible Better control of the experimental conditions
How do normalization methods work? • General approach • Assume: data from an ideal experiment would have characteristic A E.g. mean expression is equal for each sample Note: this is an assumption! • If the data do not have characteristic A, change the data such that the data now do have characteristic A E.g. Multiply each sample’s expression by a factor
Example: quantile normalization • Assume: “Most probes are not differentially expressed” “As many probes are up and downregulated” • Reasonable consequence: The distribution of the expression values is identical for each sample • Normalization: Make the distribution of expression values identical for each sample
Quantile normalization in practice • Choose a target distribution • Typically the average of the measured distributions • All samples will get this distribution after normalization • Quantile normalization: • Replace the ith largest expression value in each sample by the ith largest value in the target distribution • Consequence: • Distribution of expressions the same between samples • Expressions for specific genes may differ
Less radical forms of normalization • Make the means per sample the same • Make the medians the same • Make the variances the same • Loess curve smoothing • Same idea, but less change to the data
Overnormalizing • Normalizing can remove or reduce true biological differences • Example: global increase in expression • Normalization can create differences that are not there • Example: almost global increase in expression • Usually: normalization reduces unwanted variation
Batch effects • Differences between batches are even stronger than between samples in the same batch • Note: batch effects at several stages • Normalization is not sufficient to remove batch-effects • Methods available (comBat) but not perfect • Best: avoid batch effects if possible
Confounding by batch • Take care of batch-effects in experimental design • Problem: confounding of effect of interest by batch effects • Example: Golub data • Solution: balance or randomize
Differential expression • Two experimental conditions • Treated versus untreated • Two distinct phenotypes • Tumor versus normal tissue • Which genes can reliably be called differentially expressed? • Also: continuous phenotypes • Which gene expressions are correlated with phenotype?
Variation in gene expression • Technical variation • Variation due to measurement technique • Variability of measured expression from experiment to experiment on the same subject • Biological variation • Variation between subjects/samples • Variability of “true” expression between different subjects • Total variation • Sum of technical and biological variation
Reliable assessment • Two samples always have different expression • Maybe even a high fold change • Due to random biological and technical variation • Reliable assessment of differential expression: • Show: fold change found cannot be explained by random variation
Assessment of differential expression • Two interrelated aspects: • Fold change: • How large is the expression difference found? • P-value: • How sure are we that a true difference exists?
Modeling variation • How does gene expression depend on experimental conditions? • Can often be well modeled with linear models • Limma: • linear models for microarray analysis • Gordon Smyth, W. and E. Hall Institute, Australia
Multiplicative scale effects • Assumption: effects on gene expression work in a multiplicative way (“fold change”) • Example: treatment increases gene expression of gene MMP8 by a factor 2 • “2-fold increase” • Treatment decreases gene expression of gene MMP8 by a factor 2 • “2-fold decrease”
Multiplicative scale errors • Assumption: variation on gene expression works in a multiplicative way • A 2-fold increase by chance is just as likely as a 2-fold decrease by chance • When true expression is 4, measuring 8 is as likely as measuring 2
Working on the log scale • When effects are multiplicative, log-transform! • Usual in microarray analysis: log to base 2 • Remember: log(ab) = log(a)+log(b) • 2 fold increase = +1 to log expression • 2 fold decrease = -1 to log expression • Log scale makes multiplicative effects symmetric • ½ and 2 are not symmetric around 1 (= no change) • -1 and +1 are symmetric around 0 (= no change)
A simple linear model • Example: treated and untreated samples • Model separately for each gene • Log Expression of gene 1: E1 • E1 = a + b * Treatment + error • a: intercept = average untreated logexpression • b: slope = treatment effect
Modeling all genes simultaneously • E1 = a1 + b1 * Treatment + error • E2 = a2 + b2 * Treatment + error • … • E20,000 = a20,000 + b20,000 * Treatment + error • Same model, but • Separate intercept and slope for each gene • And separate sd sigma1, sigma2, … of error
Estimates and standard errors • Gene 1: Estimates for a1, b1 and sigma1 • Estimate of treatment effect of gene 1 • b1 is the estimated log fold change • standard error s.e.(b1) depends on sigma1 • Regular t-test for H0: b1=0: • T = b1/s.e.(b1) • Can be used to calculate p-values. • Just like regular regression, only 20,000 times
Back to original scale • Log scale regression coefficient b1 • Average log fold change • Back to a fold change: 2^b1 • b1= 1 becomes fold change 2 • b1 = -1 becomes fold change 1/2
Confounders • Other effects may influence gene expression • Example: batch effects • Example: sex or age of patients • In a linear model we can adjust for such confounders
Flexibility of the linear model • Earlier: E1 = a1 + b1 * Treatment + error • Generalize: • E1 = a1 + b1 * X + c1 * Y + d1 + Z + error • Add as many variables as you need.
Empirical Bayes • So far: each gene on its own • 20,000 unrelated models • Limma: exchange information between genes • “Borrowing strength” • By empirical Bayes arguments
Estimating variance • For each gene a variance is estimated • Small sample size: variance estimate is unreliable • Too small for some genes • Too large for others • Variance estimated too small: false positives • Variance estimated too large: low power
Large and small estimated variance • Gene with low variance estimate • Likely to have low true variance • But also: likely to have underestimated variance • Gene with high variance estimate • Likely to have high true variance • But also: likely to have overestimated variance • Limma’s idea: • Use information from other genes to assess whether variance is over/underestimated
Variance model • Limma has a gene variance model • All gene’s variances are drawn at random from an inverse gamma distribution • Based on this model: • Large variances are shrunk downwards • Small variances are shrunk upwards
Effect of variance shrinkage • Genes with large fold change and large variance • More power • More likely to be significant • Genes with small fold change and small variance • Less power • Less likely to be significant
Limma and sample size • Shrinkage of limma only effective for small sample size (< 10 samples/group) • Added information of other genes becomes negligeable if sample size gets large • Large samples: Doing limma is the same as doing regression per gene
Modelling count data • Distinguish three types of variation • Biological variation • Technical variation • Count variation • Count variation is important for low-expressed genes • Generally biological variation most important
Overdispersion • Modelling count data: two stages • Model how gene expression varies from sample to sample • Model how the observed count varies by repeated sequencing of the same sample • Stage 2 is specific for RNAseq
Two approaches • Approach 1: Model the count variation and the between-sample variation • edgeR • Deseq • Approach 2: Normalize the count data and model only the biological variation • Voom + limma • Approach 3: Model count variation only • Popular but very wrong!
20,000 p-values • Fitting 20,000 linear models • Some variance shrinkage • Result: • 20,000 fold changes • 20,000 p-values • Which ones are truly differentially expressed?
Multiple testing • Doing 20,000 tests: risk false positive 20,000 times • If 5% of null hypotheses is significant, expect 1,000 significant by pure chance • How to make sure you can really trust the results?
Bonferroni • Classical way of doing multiple testing • Call K the number of tests performed • Bonferroni: significant = p-value < 0.05/K • “Adjusted p-value” • Multiply all p-values by K, compare with 0.05
Advantages of Bonferroni • Familywise error control • =Probability of making any type I error < 0.05 • With 95% chance, list of differentially expressed genes has no errors • Very strict • Easy to do
Disadvantages of Bonferroni • Very strict • “No” false positives • Many false negatives • It is not a big problem to have a few false positives • Do validation experiments later
False discovery rate (Benjamini and Hochberg) • FDR = expected proportion of false discoveries among all discoveries • Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes • Percentage: longer lists of genes are allowed to have more errors
Benjamini and Hochberg by hand 1. Order the p-values small to large Example: 0.0031, 0.0034, 0.02, 0.10, 0.65 2. Multiply the k-th p-value by m/k, where m is the number of p-values, so 0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5 which becomes 0.0155, 0.0085, 0.033, 0.125, 0.65 3. If the p-values are no longer in increasing order, replace each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become 0.0085, 0.0085, 0.033, 0.125, 0.65