1 / 23

Differential expression analysis for sequence count data

Differential expression analysis for sequence count data. Wolfgang Huber Simon Anders. Context. Research group on statistical methods for genome biology Joint appointment between EMBL HD and EBI non-coding RNA, pervasive transcription genetics of complex traits

zulema
Download Presentation

Differential expression analysis for sequence count data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Differential expression analysisfor sequence count data Wolfgang Huber Simon Anders

  2. Context • Research group on statistical methods for genome biology • Joint appointment between EMBL HD and EBI • non-coding RNA, pervasive transcription • genetics of complex traits • HT microscopy for systems analysis • large-scale combinatorial RNAi and morphology phenotypes • ‘metrology’ for several genomic and proteomic technologies • Bioconductor

  3. Samples Count data in HTS • RNA-Seq • Tag-Seq • ChIP-Seq • Bar-Seq • GliNS1 G144 G166 G179 CB541 CB660 • 13CDNA73 4 0 6 1 0 5 • A2BP1 19 18 20 7 1 8 • A2M 2724 2209 13 49 193 548 • A4GALT 0 0 48 0 0 0 • AAAS 57 29 224 49 202 92 • AACS 1904 1294 5073 5365 3737 3511 • AADACL1 3 13 239 683 158 40 • [...] Genes

  4. Effect size vs significance

  5. Statistical testing • Formulate a null hypothesis (e.g. ‘expression levels in these two conditions are the same’) • Define a value computed from the data (‘test statistic’) • Use your understanding of the null hypothesis, and the rules of probability calculus, to derive its null distribution. Compare the observed value with the distribution - if its value is too extreme, that is unlikely to have happened by chance: reject the null hypothesis.

  6. Challenges with count data • discrete, positive, skewed (i.e. no normal approximation) • small numbers of replicates (i.e. cannot use distribution-free methods, e.g. rank based or permutation) • sequencing depth (coverage) varies

  7. Strategies that have served us well with microarray data • Use a distribution approximation in order to infer the ‘tail behaviour’ (probability of extreme values) from mean and variance. • Share data across genes in order to improve the estimation of the variance: similar genes should have similar variance. • limma / eBayes • SAGE: edgeR by Robinson and Smyth

  8. Variance and mean are correlated • Tag-Seq counts of two replicate gliablastoma-derived tissue cultures (P. Bertone / EBI) local regression v = f(x) + x linear v = ax2 + x Poisson v = x

  9. Technical and biological replicates RNA-Seq of yeast (Nagalakshmi et al. 2008) biological replicates technical replicates

  10. Poisson • Is a natural ‘first try’ for count data. It models the minimal amount of variability that just comes from random sampling - even if all other variables are exactly fixed. • It fits well for technical replicates1 - but hopelessly underestimates variance for biological replicates2. • 1 Marioni et al. (2008) • 2 Robinson and Smyth (2007), Nagalakshmi et al. (2008)

  11. The negative-binomial distribution overdispersion parameter

  12. NB distribution can be motivated by a hierarchical model Biological sample to sample variability Γ Poisson counting statistics P Overall distribution NB NB(μ, σ2 + μ) = Γ (μ, σ2) ∗ P(μ)

  13. Model fitting • to get an unbiased estimate of σi², subtract an estimator of the “shot-noise” contribution

  14. Testing for differential expression • We use a test similar to the one used in edgeR • For each of two conditions A and B, add the counts from all replicates, and consider them NB-distributed with moments as fitted. • Calculate the probability of observing the difference KiA- KiB (or more extreme), conditioned on the sum KiA+ KiB, resulting in a p value.

  15. Differential expression • RNA-Seq data: tumor vs control

  16. Type I error control • Comparison of one GNS replicate with another one.

  17. Selection across the dynamic range all transcripts Hits from: - DESeq - edgeR

  18. Working without replicates • Comparing 1-vs-1 with 2-vs-2 620 202 271 15,529

  19. Variance-stabilizing transformation • The estimated variance-mean dependence allows a transformation that renders the count data approximately homoskedastic. This is useful e.g. for computing sample-sample distances.

  20. Conclusions • Parametric model provides power for detecting differentially expressed genes even if there are few replicates (while controlling type I error) • Poisson model describes the minimal amount of error - between biological replicates, it will be larger. • Key assumption: negative binomial distribution. Mean estimated directly for each gene. Variance (or over-dispersion) estimated jointly for all genes, in the form of a local regression relationship. • Software: R package DESeq • Extensions: transcript length (RNA-Seq) other covariates (‘GC’) more complex contrasts (ANOVA), regression on continuous variables

  21. Acknowledge-ments • Simon Anders • Bernd Fischer • Greg Pau • Elin Axelsson • Daniel Murrell • Julien Gagneur • Nicolas delHomme • Stefan Wilkening • Emilie Fritsch • Lars Steinmetz • Paul Bertone • Jan Korbel • All contributors to the R and Bioconductor projects

More Related