Biases in RNA- Seq data

Biases in RNA-Seq data

Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both transcripts is doubled in a treatment sample. The biological variance is the same for both transcripts. They have the same level of differential expression. The transcripts are fragmented into short reads of 10 bases, and reported by the RNA-Seq experiment. There will be more hits to the 100 base transcript – its n will be larger, so it will be reported as more significantly changed.

Oshlack and Wakefield 2009, Biology Direct, 4, 14

Random priming aims to sample transcripts uniformly, rather than from just one end (such as with the oligodT primer ……)

Counts of reads along gene Apoein different tissues of the Wold data. (a) brain, (b) liver, (c) skeletal muscle. Each vertical line stands for the count of reads starting at that position. The grey lines are counts in the UTR regions and a further 100 bp. Here introns are deleted and exons are connected into a single piece. Li et al. 2010, Genome Biology, 11, R50

Nucleotide frequencies versus position for stringently mapped reads. For each experiment, mapped reads were extended upstream of the 5′-start position, such that the first position of the actual read is 1 and positions 0 to −20 are obtained from the genome. The first hexamer of the read is shaded. Brief experimental protocols are indicated in the key Biases are caused by hexamer priming that is not random Hansen et al. Nucleic Acids Research, 2010, 38, e31

Roberts et al. 2011, Genome Biology, 12, R22

Human experiment (SRA012427) Yeast experiment (SRA020818_RH) GC content biases some RNA-Seq experiments, but not at the same level in all experiments. Roberts et al. 2011, Genome Biology, 12, R22

Next-generation sequencing is rapidly evolving. There is no market leader, and there have been only a relatively small number of published studies of RNA-Seq for even the most popular NGS platforms. There are clearly biases in the data, and the protocols and chemistry used to generate the data leaves signatures. It is hard to perform meta-analysis. AffymetrixGeneChips are the dominant platform for microarray observations, and have been so for almost a decade – there are more than one hundred thousand hybridizations in the public domain. There has only been a handful of standardised protocols used. This huge dataset allows sensitive meta-analysis.

Affymetrix

Applied Biosystems

Illumina

Life Technologies

Pacific Biosciences

Helicos 1 year Helicos since 2007

Biases in RNA- Seq data

Biases in RNA- Seq data

Presentation Transcript

How to store and visualize RNA-seq data

RNA-Seq

Expression A nalysis of RNA - seq Data

RNA- seq Analysis

RNA- Seq Lab

RNA seq (I)

Identifying differentially expressed genes from RNA- seq data

Le RNA-seq

Biases in RNA- Seq data October 30, 2013 NBIC Advanced RNA- Seq course

“BIG DATA” from RNA- Seq Experiments

Bioinformatics for DNA - seq and RNA- seq experiments

RNA-seq data

RNA-Seq datasets

Supplemental Data Figure 1. RNA-seq data analysis pipeline.

Bioinformatics Pipelines for RNA- Seq Data Analysis

RNA- seq Analysis in Galaxy

RNA-SEQ