Biases in RNA-Seq dataOctober 30, 2013NBIC Advanced RNA-Seq course Prof. dr. Antoine van Kampen Bioinformatics Laboratory Academic Medical Center Biosystems Data Analysis Group Swammerdam Institute for Life Sciences email@example.com
Aim: to provide you with a brief (almost up-to-date) overview of literature about biases in RNA-seq data such that you become aware of this potential problem (and solutions)
What is the problem? • Experimental (and computational) biases affect expression estimates and, therefore, subsequent data analysis: • Differential expression analysis • Study of alternative splicing • Transcript assembly • Gene set enrichment analysis • Other downstream analysis • We must attempt to avoid, detect and correct these biases
Types of bias • Library size • Gene length • Mappability of reads • lower sequence complexity, repeats, ...... • Position • Fragments are preferentially located towards either the beginning or end of transcripts • Sequence-specific • biased likelihood for fragments being selected • %GC
Few words about microarrays Malone and Oliver (2011) BMC Biology, 9:34 • Are not free of bias • It has taken a decade to understand these biases and to provide solutions • Recognition of biases (e.g., by the MicroArray Quality Control (MAQC) consortium) has led to the development of quality control standards • For RNA-Seq it will also take some time to "understand the data". • Comparison of microarrays and RNA-Seq may help to identify bias
Normalization for gene length and library size: RPKM / FPKM
Within one sample transcript 1 (size = L) Count =6 transcript 2 (size=2L) Count = 12 You can’t conclude that gene 2 has a higher expression than gene 1!
Comparison of two samples transcript 1 (sample 1) Count =6, library size = 600 transcript 1 (sample 2) Count =12, library size = 1200 You can’t conclude that gene 1 has a higher expression in sample 2 compared to sample 1!
RPKM: Reads per kilobase per million mapped reads • Unit of measurement • RPKM reflects the molar concentration of a transcript in the starting sample by normalizing for • RNA length • Total read number in the measurement • This facilitates comparison of transcript levels within and between samples Mortazavi et al (2008) Nature Methods, 5(7), 621
Example1 2500kb transcript with 900 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped)
Example1 2500kb transcript with 900 alignments in a sample of 10 million reads(out of which 8 million reads can be mapped)
Example 2 Given a 40M read measurement, how many reads would we expect for a 1 RPKM measurement for a 2kb transcript?
Example 2 Given a 40M read measurement, how many reads would we expect for a 1 RPKM measurement for a 2kb transcript?
FPKM: Fragments per K per M Trapnell et al (2010) Nature Biotechnology, 28(5), 511 What's the difference between FPKM and RPKM? Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads.
Other normalization methods Irizarry, et al (2003). Biostatistics (Oxford, England), 4(2), 249–64. Bullard et al (2010) BMC bioinformatics, 11, 94. Robinson & Oshlack (2010). Genome biology, 11(3), R25. Dillies, et al (2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing. Briefings in Bioinformatics Spike-ins Housekeeping genes (Bullard et al, 2010) Upper-quartile (Bullard et al, 2010). Counts are divided by (75th) upper-quartile of counts for transcripts with at least one read TMM (Robinson and Oshlack, 2010). Trimmed Mean of M values Quantile normalization (Irizarry et al, 2003; developed for microarrays) Comparison of normalization methods (Dillies, 2013)
Why (not) use spike-ins? • Opinions differ on whether this could be made to work. • In M.D. Robinson and A. Oshlack “A scaling normalisation method for differentialexpression analysis of RNA-seq data”, Genome Biology 11, R25 (2010), • it is claimed that • “In order to use spike-in controls for normalisation, the ratio of the concentration of the spike to the sample must be kept constant throughout the experiment. In practice this is difficult to achieve and small variations will lead to biased estimation of the normalisation factor.”
Quantile Normalization Schematic representation of quantile normalization: the value x, which is the α-thquantile of all probes on chip 1, is mapped to the value y, which is the α quantile of the reference distribution F2. Transform intensities /read counts into one standard distribution shape http://www.people.vcu.edu/~mreimers/OGMDA/normalize.expression.html
Normalization objective Normalization factors/procedures should ensure that a gene with the same expression level in two samples is not detected as differentially expressed (DE).
Thought experiment 150 150 100 100 50 50 Condition A Condition B No differential expression of these genes Suppose • Two RNA populations (samples): A and B • The same 3 genes expressed in both samples • Numbers indicate number of transcripts / cell
Thought experiment 150 150 150 100 100 100 50 50 50 Condition A Condition B Still no differential expression of first three genes However, RNA production in A is twice the size of B Suppose • Two RNA populations (samples): A and B • The same 3 genes expressed in both samples • Numbers indicate number of transcripts / cell • Now condition A has 3 additional genes not in B with equal number and expression
Thought experiment 600 400 300 300 200 200 200 #reads #reads 100 100 Reads: 600 600 1200 A correct normalization factor adjust condition A by a factor of two (no differential expression) Proportion of reads attributed to a gene in a library depends on the expression properties of whole sample If a sample has larger RNA output (S) then RNAseq will under-sample many genes (lower number of reads) Suppose we sequence both samples with the same depth (1200 reads) These reads get ‘distributed’ over the expressed genes
RKPM would fail in this example (assume transcript lengths are the same) In this example: Condition A, first (red) gene: Condition B, first (red) gene: RKPM normalization would result in differential expression while this is not the case! Because we didn’t take into account the total RNA production.
When does RKPM fail? • If samples have largely different RNA production • Many unique genes and/or highly expressed genes • If many genes in one sample have a very high expression compared to the other samples • If RNA sample is contaminated • Reads that represent the contamination will take away reads from the true sample, thus dropping the number of reads of interest. • If you assume that your samples are ‘comparable’ then RKPM is OK • e.g., technical replicates
RPKM is OK for measuring relative abundance of transcripts within one experiment (e.g. one lane of Illumina sequencer): • RNA spikes in Illumina: • 300 and 1500nt (arabidopsis) and 10000nt (-phage) • 104, 105, …, 109 transcripts per 100ng mRNA • (data from Mortazavi et al.)
Let us take RNA production into consideration Transcripts Reads RNA production Condition A = 600 Condition B = 300 Larger RNA production, results in less reads per gene (assuming same sequence depth). Correction (for ‘red’ gene): condition A: 100 reads*600 = 60000 condition B: 200 reads*300 = 60000 Again, don’t consider gene length (assume transcript lengths are of equal size)
Closer look: relative gene expression μ = unknown true expression (transcripts/cell) L = unknown true gene length (bp) Sk = Total RNA production (bp) g = gene k = library In cell only two genes are expressed A: 10 transcripts; L=10 B: 40 transcripts; L=5 • total RNA production = 10*10 + 40*5 = 300 • Relative expression of A is 10*10/300 = 0.33 (not 10/50=0.2) • Relative expression of B = 40*5/300 = 0.66 (not 40/50=0.8)
Expected number of reads Y = read countμ = unknown true expression (transcripts/cell) L = unknown true gene length (bp) Sk = Total RNA production (bp) Nk = library size g = gene k = library • The total RNA production (Sk) cannot be estimated directly • we do not know the expression levels and true lengths of every gene. • Thus, how to correct for RNA production?
Back to our example Transcripts Reads RNA production Condition A = 600 Condition B = 300 (factor 2 difference) Larger RNA production, results in less reads per gene (assuming same sequence depth). The correction factor that we applied is the ratio of RNA production Correction (for ‘red’ gene): condition A: 100 reads*600/300 = 200 condition B: 200 reads
Normalization factor Essentially a global fold change • The total RNA production (Sk) cannot be estimated directly • Relative RNA production of two samples (fk) can more easily be determined. • Empirical strategy • assumption that the majority of them are not differentially expressed. • How to do this? E.g., Trimmed Mean of M-values (TMM)
Yig = read counts for gene g in sample i = 1, 2 Ni = total read counts for sample i = 1, 2 Ratio Average expression
Example. Accounting for total number of reads Technical replicates mean log ratio ~0 Data from Marioni, 2008
Example. Accounting for total number of reads Technical replicates housekeepinggenes Liver / Kidney mean log ratio shifted to higher kidney expression
A few strongly expressed, differentially expressed genes in liver less sequence reads available for bulk of lower expressed liver genes ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney)
A few strongly expressed, differentially expressed genes in liver less sequence reads available for bulk of lower expressed liver genes ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney) Trim the data M 30% A 5%
Then, from the trimmed subset of genes, calculate a relative scaling factor from a weighted average of M –values: Implemented in edgeR (Bioconductor)
A few strongly expressed, differentially expressed genes in liver less sequence reads available for bulk of lower expressed liver genes ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney) Trim the data M 30% A 5% log λTMM Offset for HK genes is similar to λ
These data have not been normalized with RPKM! Gene length bias This bias (a) affects comparison between genes or isoforms within one sample and (b) results in more power to detect longer transcripts 33% of highest expressed genes 33% of lowest expressed genes Oshlack and Wakefield (2009) Biology Direct, 16, 4
Mean-variance relationship . Sample variance across lanes in the liver sample from the Marioni et al (2008) Genome research, 18(9), 1509–17. Red line: for the one third of shortest genes Blue line: for the longest genes. Black line: line of equality Plot A: blue/red lines close to line of equality between mean and variance which is what would be expected from a Poisson process.
Mean-variance relationship log(variance) variance mean log(mean / length) Plot B: Counts divided by gene length (which you do when using RPKM). The short genes have higher variance for a given expression level than long genes. Because of the change in variance we are still left with a gene length dependency. Thus, RPKM does not fully correct After correction no longer Poisson distributed
Just to refresh your memory The power of a statistical test is the probability that the test will reject the null hypothesis when the alternative hypothesis is true (i.e. the probability of not committing a Type II error).
Power and gene length bias t = t statistic SE = standard error L = gene length c = proportionality constant δ = effect size (power is related to effect size) In the paper they show that δ is still related to L after accounting for gene length More power to detect longer differentially expressed transcripts
Gene set enrichment analysis and gene length bias • This bias affects Gene Set Analysis (GSA) • In GSA we compare sets of transcripts that are potentially of different length • For gene length corrections in this context see: • Gao et al (2011) Bioinformatics, 27(5), 662 (R package) • Young et al (2010) Genome Biology, 11:R14 (GOseq) • Correction at gene level or gene set level
Mappability bias Schwartz et al (2011) PLoS One, 6(1), e16685 • Uniquely mapping reads are typically summarized over genomic regions • E.g., regions with lower sequence complexity will tend to end up with lower sequence coverage • Regions with higher/lower mappability may give spurious results in downstream analysis • Test: generate all 32nt fragments from hg18 and align them back to hg18 • 32 nt corresponds to trimmed Illumina reads • Each fragment that cannot be uniquely aligned is unmappableand its first position is considered an unmappable position
Result of test Unexpected because introns are assumed to have lower sequence complexity in general Since in RNA-seq we align reads prior to further analysis, this step may already introduce a (slight) bias.