1 / 27

Data Analysis for High-Throughput Sequencing

Data Analysis for High-Throughput Sequencing. Mark Reimers Tobias Guennel Department of Biostatistics. Unto the Frontiers of Ignorance. “I love the way this workshop starts off with things we understand fairly well and works up to the cutting edge of things we don’t understand at all”

rinah-roth
Download Presentation

Data Analysis for High-Throughput Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis for High-Throughput Sequencing Mark Reimers Tobias Guennel Department of Biostatistics

  2. Unto the Frontiers of Ignorance “I love the way this workshop starts off with things we understand fairly well and works up to the cutting edge of things we don’t understand at all” - Mike Neale, Oct 14, 2010

  3. The New Boyfriend/Girlfriend

  4. Where Does HTS Really Make the Difference? • Sequencing for novel variants • ChIP-Seq for DNA-binding proteins or less common histone marks • Allele-specific expression • COMING SOON • DNA methylation

  5. Outline • Biases in reads • RNA-Seq • normalization • basic tests • differential splicing • Finding peaks in ChIP-Seq

  6. Technical Biases – Sequence Start The initial bases of reads are highly biased, and the bias depends on RNA/DNA preparation

  7. (Schroeder et al, PLoS One, 2010) calculated proportions of words (k-mers) starting at various positions Sequence Biases – K-mers Differ Expected frequencies if bases random

  8. Position of single mismatch in uniquely mapped tags Courtesy Jean & Danielle Thierry-Mieg

  9. Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Jean & Danielle Thierry-Mieg

  10. Technical Biases – Initiation Sites COX1

  11. (Harismendy et al, Genome Biology, 2009) sequenced a section of 4 HapMap individuals on Roche 454, on Illumina, and on SOLiD 454 had most even coverage Different Platforms Have Different Biases

  12. Counts of reads along gene APOE in different tissues of data from Wold lab. (a) Brain, (b) liver, (c) skeletal muscle Initiation Biases Dwarf Splicing

  13. Variation in Technical Biases • Sometimes the initial base biases change substantially – most base proportions change together – one PC explains 95% • In most preparations the initiation site biases change by a few percent • In a few preparations the initiation site biases change by ~20%-30% • This may have consequences for representation in ChIP-Seq assays

  14. RNA-Seq Data Analysis

  15. Biases in Proportions • Fragments compete for real-estate on the lane • If a few dozen genes are highly expressed in one tissue, they will competitively inhibit the sequencing of other genes, resulting in what appears to be lower expression

  16. Effects of Competition • (Robinson & Oshlak, Genome Biology, 2010)

  17. A Simple Normalization • Align the medians of the housekeeping genes, or the genes that are not expressed at very high levels in any sample, across the samples

  18. A Simple Model for Counts • Poisson distribution of counts within a gene with mean proportional to Np • SD of variation equal to square root of Np • Problem: Actual variation of counts between replicate samples is significantly higher than root Np • Probably reflecting systematic biases

  19. Hacks for Over-Dispersion • Like l fudge-factor in GWAS • Use negative binomial model • There is no relation to meaning of distribution – numbers of nulls until something happens • Convenient way to parametrise over-dispersion • Bioconductor package edgeR estimates parameters by Maximum Likelihood

  20. Alternate Transcripts: Splicing Index • For each exon, the proportion of transcripts in which the exon appears • Hard to estimate because different exons have different representation probabilities • Use ratios of exons • Use constitutive exons (if known) as baseline: for them SI=1 from Wang et al, Nature, 2008

  21. (Wang et al, Nature, 2008) measured splicing index for several tissues Detecting Alternate Splicing – I

  22. Splicing: Junction Reads • Some reads will span two different exons • Need long enough reads to be able to reliably map both sides • Can use information from one exon to identify gene and restrict possibilities for 5’ end other exon from Wang et al NAR 2010

  23. ChIP-Seq

  24. Courtesy Raphael Gottardo

  25. A View of ChIP-Seq Data • Typically reads are quite sparsely distributed over the genome • Controls (i.e. no pull-down by antibody) often show smaller peaks at the same locations • Probably due to open chromatin at promoter Rozowsky et al Nature Methods, 2009

  26. High correlation between peaks in control samples and peaks in ChIP sample Must subtract estimate of background from control tags Always Have a Control From Zhang et al, Genome Biol 2008

  27. Use the fact that reads on opposite sides of the site represent are sequenced in opposite senses Locating Binding Sites From Zhao et al NAR 2009

More Related