RNA- s eq: From experimental design to gene expression. - PowerPoint PPT Presentation

rna s eq from experimental design to gene expression n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
RNA- s eq: From experimental design to gene expression. PowerPoint Presentation
Download Presentation
RNA- s eq: From experimental design to gene expression.

play fullscreen
1 / 85
RNA- s eq: From experimental design to gene expression.
192 Views
Download Presentation
reuben-waller
Download Presentation

RNA- s eq: From experimental design to gene expression.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. RNA-seq: From experimental design to gene expression. Steve Munger @stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29, 2014

  2. Outline General overview of RNA-seq analysis. • Introduction to RNA-seq • The importance of a good experimental design • Quality control • Read alignment • Quantifying isoform and gene expression • Normalization of expression estimates

  3. Understandinggene expression Alwineet. al. PNAS 1977 DeRisiet. al. Science 1997 ON/OFF 1.5 1.5 10.5

  4. Next Generation Genome Sequencers IlluminaHiSeq and MiSeq PacBio SMRT 454 GS FLX Oxford Nanopore N

  5. RNA AGCTA ATGCTCA RNA-Seq: Sequencing Transcriptomes AGCTA TAGATGCTCA AGCTAATC ATGCTCA AGCTA ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG N

  6. A/G TAGATGCTCA AGCTAATCCTAG Applications of RNA-Seq Technology A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA G ATGCTCA AGCTA G ATGCTCA Gene expression analysis Novel Exon discovery G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG TAGATGCTCAA AGCTAATCCTAG A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA A ATGCTCA AGCTA A ATGCTCA G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG Allele Specific Gene Expression RNA Editing N

  7. Total RNA RNA-Seq mRNA mRNA after fragmentation cDNA Adaptors ligated to cDNA Single/ Paired End Sequencing N

  8. Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

  9. Know your application – Design your experiment accordingly • Differential expression of highly expressed and well annotated genes? • Smaller sample depth; more biological replicates • No need for paired end reads; shorter reads (50bp) may be sufficient. • Better to have 20 million 50bp reads than 10 million 100bp reads. • Looking for novel genes/splicing/isoforms? • More read depth, paired-end reads from longer fragments. • Allele specific expression • “Good” genomes for both strains. N

  10. Good Experimental Design One IlluminaHiSeqFlowcell = 8 lanes Multiplex up to 24 samples in a lane Multiplexing Replication Randomization 187 Million SE reads per lane 374 Million PE reads per lane • Cost per lane: • 100 bases PE: $1800 - $2700 • 50 bases PE: $1400 - $2100 • 50 bases SE: $800 - $1200 N

  11. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design N

  12. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design Good Design N

  13. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Better Design Bad Design Good Design N

  14. Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

  15. Total RNA poly-A tail selection mRNA mRNA after fragmentation Not actually random “Not So” random primers cDNA Size selection step (gel extraction) Adaptors ligated to cDNA PCR amplification • Adapter dimers S

  16. Sequence Read: Sanger fastq format @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 The member of a pair Instrument: run/flowcell id Flowcell lane and tile number Index Sequence X-Y Coordinate in flowcell Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, Phred Score: SN

  17. To Trim or not to Trim? Quality control S

  18. Quality Control: Sequence quality per base position • Bad data • High Variance • Quality Decrease with Length • Good data • Consistent • High Quality Along the reads S

  19. Per sequence quality distribution bad data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

  20. Per sequence quality distribution bad data Good data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

  21. Quality Control: Sequence Content Across Bases S

  22. K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base position NGS Data Preprocessing

  23. K-mer content Most samples

  24. Duplicated sequences Good: non-unique sequences make up less than 20% Bad: non-unique sequences make >50% NGS Data Preprocessing S

  25. PCR duplicates or high expressed genes? The case of Albumin. ~80,000x coverage here S

  26. Tradeoffs to preprocessing data • Signal/noise -> Preprocessing can remove low-quality “noise”, but the cost is information loss. • Some uniformly low-quality reads map uniquely to the genome. • Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. • Duplicated reads or just highly expressed genes? • Most aligners can take quality scores into consideration. • Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

  27. The problem with trimming all SE reads 100bp reads All reads trimmed to 75 bp Longer is better for splice junction spanning reads S

  28. Quality Control: Resources • FASTX-Toolkit • http://hannonlab.cshl.edu/fastx_toolkit/ • FastQC • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ NGS Data Preprocessing S

  29. RNA-Seq Work Flow Study Design Total RNA Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression KB

  30. Alignment 101 100bp Read ACATGCTGCGGA Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

  31. The perfect read: 1 read = 1 unique alignment. 100bp Read ACATGCTGCGGA ✓ Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

  32. Some reads will align equally well to multiple locations. “Multireads” 100bp Read ✗ ACATGCTGCGGA ACATGCTGCGGA ✓ ✗ ACATGCTGCGGA 1 read 3 valid alignments Only 1 alignment is correct ACATGCTGCGGA KB

  33. The worst case scenario. 100bp Read ACATGCTGCGGA ✗ ACATGCTGCGGA ✗ ACATGCTGCGGC 1 read 2 valid alignments Neither is correct ACATGCTGCGGA KB

  34. Individual genetic variation may affect read alignment. 100bp Read ✗ ATATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGC ✗ ACATGCTGCGGA KB

  35. Aligning Billions of Short Sequence Reads Gene A Gene B Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT Designed to align the short reads fast, but not accurate KB

  36. Aligning Sequence Reads Gene A Gene B • Gene family (orthologous/paralogous) • Low-complexity sequence • Alternatively spliced isoforms • Pseudogenes • Polymorphisms • Indels • Structural Variants • Reference sequence Errors KB

  37. Align to Genome or Transcriptome? Genome Transcriptome KB

  38. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  39. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  40. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  41. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  42. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  43. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  44. Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome KB

  45. Reference Transcriptome Exon 1 Gene 1 Exon 2 Exon 3 Isoform 1 Isoform 2 Isoform 3 Exon 1 Gene 2 Exon 2 Exon 3 Exon 4 Isoform 1 Isoform 2 Isoform 3 KB

  46. Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform. KB

  47. Better Approach: Aligning to Transcriptome and Genome Align to Transcriptome First Align the remaining reads to Genome next Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems KB RUM, TopHat2, STAR Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

  48. Conclusions • There is no perfect aligner. Pick one well-suited to your application. • E.g. Want to identify novel exons? Don’t align only to the known set of isoforms. • Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment. • Consider running the same fastq files through multiple alignment pipelines specific to each application. • Gene expression -> Bowtie to transcriptome • Exon discovery -> RUM or other hybrid mapper • Variant detection -> GSNAP, GATK, Samtoolsmpileup • If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold. KB

  49. Output of most aligners: Bam/Sam file of reads and genome positions S

  50. Visualization of alignment data (BAM/SAM) • Genome browsers – UCSC, IGV, etc.