Rna s eq from experimental design to gene expression
This presentation is the property of its rightful owner.
Sponsored Links
1 / 85

RNA- s eq: From experimental design to gene expression. PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

RNA- s eq: From experimental design to gene expression. Steve Munger @ stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29 , 2014. Outline. General overview of RNA-seq analysis. Introduction to RNA- s eq

Download Presentation

RNA- s eq: From experimental design to gene expression.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Rna s eq from experimental design to gene expression

RNA-seq: From experimental design to gene expression.

Steve Munger

@stevemunger

The Jackson Laboratory

UMaine Computational Methods in Biology/ Genomics

September 29, 2014


Outline

Outline

General overview of RNA-seq analysis.

  • Introduction to RNA-seq

  • The importance of a good experimental design

  • Quality control

  • Read alignment

  • Quantifying isoform and gene expression

  • Normalization of expression estimates


Rna s eq from experimental design to gene expression

Understandinggene expression

Alwineet. al. PNAS 1977

DeRisiet. al. Science 1997

ON/OFF

1.5

1.5

10.5


Rna s eq from experimental design to gene expression

Next Generation Genome Sequencers

IlluminaHiSeq and MiSeq

PacBio SMRT

454 GS FLX

Oxford Nanopore

N


Rna seq sequencing transcriptomes

RNA

AGCTA

ATGCTCA

RNA-Seq: Sequencing Transcriptomes

AGCTA

TAGATGCTCA

AGCTAATC

ATGCTCA

AGCTA

ATGCTCA

AGCTA

AGTAGATGCTCA

AGCTA

ATGCTCA

AGCTA

ATGCTCA

AGCTA

ATGCTCA

TAGATGCTCA

AGCTAATC

CTCA

AGCTAATCCTAG

N


Applications of rna seq technology

A/G

TAGATGCTCA

AGCTAATCCTAG

Applications of RNA-Seq Technology

A

AGCTA

ATGCTCA

A

AGCTA

TAGATGCTCA

AGCTAATC

A

ATGCTCA

A

AGCTA

ATGCTCA

AGCTA

A

AGTAGATGCTCA

AGCTA

G

ATGCTCA

AGCTA

G

ATGCTCA

Gene expression analysis

Novel Exon discovery

G

AGCTA

ATGCTCA

TAGATGCTCA

G

AGCTAATC

G

CTCA

AGCTAATCCTAG

TAGATGCTCAA

AGCTAATCCTAG

A

AGCTA

ATGCTCA

A

AGCTA

TAGATGCTCA

AGCTAATC

A

ATGCTCA

A

AGCTA

ATGCTCA

AGCTA

A

AGTAGATGCTCA

AGCTA

A

ATGCTCA

AGCTA

A

ATGCTCA

G

AGCTA

ATGCTCA

TAGATGCTCA

G

AGCTAATC

G

CTCA

AGCTAATCCTAG

Allele Specific Gene Expression

RNA Editing

N


Rna s eq from experimental design to gene expression

Total RNA

RNA-Seq

mRNA

mRNA after

fragmentation

cDNA

Adaptors ligated

to cDNA

Single/ Paired End

Sequencing

N


Challenges in rna seq work flow

Challenges in RNA-Seq Work Flow

Study Design

RNA isolation/ Library Prep

Sequencing Reads (SE or PE)

Aligned Reads

Quantified isoform and gene expression

N


Know your application design your experiment accordingly

Know your application – Design your experiment accordingly

  • Differential expression of highly expressed and well annotated genes?

    • Smaller sample depth; more biological replicates

    • No need for paired end reads; shorter reads (50bp) may be sufficient.

    • Better to have 20 million 50bp reads than 10 million 100bp reads.

  • Looking for novel genes/splicing/isoforms?

    • More read depth, paired-end reads from longer fragments.

  • Allele specific expression

    • “Good” genomes for both strains.

N


Good experimental design

Good Experimental Design

One IlluminaHiSeqFlowcell = 8 lanes

Multiplex up to 24 samples in a lane

Multiplexing

Replication

Randomization

187 Million SE reads per lane

374 Million PE reads per lane

  • Cost per lane:

  • 100 bases PE: $1800 - $2700

  • 50 bases PE: $1400 - $2100

  • 50 bases SE: $800 - $1200

N


Rna s eq from experimental design to gene expression

RNA-Seq Experimental Design: Randomization

Experimental Group 1

Experimental Group 2

Two Illumina Lanes

Random.org

Bad Design

N


Rna s eq from experimental design to gene expression

RNA-Seq Experimental Design: Randomization

Experimental Group 1

Experimental Group 2

Two Illumina Lanes

Random.org

Bad Design

Good Design

N


Rna s eq from experimental design to gene expression

RNA-Seq Experimental Design: Randomization

Experimental Group 1

Experimental Group 2

Two Illumina Lanes

Random.org

Better Design

Bad Design

Good Design

N


Challenges in rna seq work flow1

Challenges in RNA-Seq Work Flow

Study Design

RNA isolation/ Library Prep

Sequencing Reads (SE or PE)

Aligned Reads

Quantified isoform and gene expression

N


Rna s eq from experimental design to gene expression

Total RNA

poly-A tail selection

mRNA

mRNA after

fragmentation

Not actually random

“Not So” random

primers

cDNA

Size selection step (gel extraction)

Adaptors ligated

to cDNA

PCR amplification

  • Adapter dimers

S


Sequence read sanger fastq format

Sequence Read: Sanger fastq format

@HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1

TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG

+

CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>[email protected]

@HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1

The member of a pair

Instrument: run/flowcell id

Flowcell lane and tile number

Index Sequence

X-Y Coordinate in flowcell

Q = -10 log10 P

10 indicates 1 in 10 chance of error

20 indicates 1 in 100,

30 indicates 1 in 1000,

Phred Score:

SN


Quality control

To Trim or not to Trim?

Quality control

S


Quality control sequence quality per base position

Quality Control: Sequence quality per base position

  • Bad data

  • High Variance

  • Quality Decrease with Length

  • Good data

  • Consistent

  • High Quality Along the reads

S


Per sequence quality distribution

Per sequence quality distribution

bad data

Y= number of reads

X= Mean sequence quality

Average data

NGS Data Preprocessing

S


Per sequence quality distribution1

Per sequence quality distribution

bad data

Good data

Y= number of reads

X= Mean sequence quality

Average data

NGS Data Preprocessing

S


Quality control sequence content across bases

Quality Control: Sequence Content Across Bases

S


K mer content

K-mer content

counts the enrichment of every 5-mer within the sequence library

Bad: If k-mer enrichment >= 10 fold at any individual base position

NGS Data Preprocessing


K mer content1

K-mer content

Most samples


Duplicated sequences

Duplicated sequences

Good: non-unique sequences make up less than 20%

Bad: non-unique sequences make >50%

NGS Data Preprocessing

S


Pcr duplicates or high expressed genes the case of albumin

PCR duplicates or high expressed genes? The case of Albumin.

~80,000x coverage here

S


Tradeoffs to preprocessing data

Tradeoffs to preprocessing data

  • Signal/noise -> Preprocessing can remove low-quality “noise”, but the cost is information loss.

    • Some uniformly low-quality reads map uniquely to the genome.

    • Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site.

    • Duplicated reads or just highly expressed genes?

    • Most aligners can take quality scores into consideration.

    • Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples.

S


The problem with trimming all se reads

The problem with trimming all SE reads

100bp reads

All reads trimmed to 75 bp

Longer is better for splice junction spanning reads

S


Quality control resources

Quality Control: Resources

  • FASTX-Toolkit

    • http://hannonlab.cshl.edu/fastx_toolkit/

  • FastQC

    • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

NGS Data Preprocessing

S


Rna seq work flow

RNA-Seq Work Flow

Study Design

Total RNA

Sequencing Reads (SE or PE)

Aligned Reads

Quantified isoform and gene expression

KB


Alignment 101

Alignment 101

100bp Read

ACATGCTGCGGA

Chr 3

ACATGCTGCGGA

Chr 2

Chr 1

KB


The p erfect r ead 1 read 1 unique alignment

The perfect read: 1 read = 1 unique alignment.

100bp Read

ACATGCTGCGGA

Chr 3

ACATGCTGCGGA

Chr 2

Chr 1

KB


Some reads will align equally well to multiple locations multireads

Some reads will align equally well to multiple locations. “Multireads”

100bp Read

ACATGCTGCGGA

ACATGCTGCGGA

ACATGCTGCGGA

1 read

3 valid alignments

Only 1 alignment is correct

ACATGCTGCGGA

KB


The worst case scenario

The worst case scenario.

100bp Read

ACATGCTGCGGA

ACATGCTGCGGA

ACATGCTGCGGC

1 read

2 valid alignments

Neither is correct

ACATGCTGCGGA

KB


Individual genetic variation may affect read alignment

Individual genetic variation may affect read alignment.

100bp Read

ATATGCTGCGGA

ACATGCTGCGGA

ACATGCTGCGGC

ACATGCTGCGGA

KB


Aligning billions of short sequence reads

Aligning Billions of Short Sequence Reads

Gene A

Gene B

Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT

Designed to align the short reads fast, but not accurate

KB


Aligning sequence reads

Aligning Sequence Reads

Gene A

Gene B

  • Gene family (orthologous/paralogous)

  • Low-complexity sequence

  • Alternatively spliced isoforms

  • Pseudogenes

  • Polymorphisms

  • Indels

  • Structural Variants

  • Reference sequence Errors

KB


Align to genome or transcriptome

Align to Genome or Transcriptome?

Genome

Transcriptome

KB


Aligning to reference genome exon first alignment

Aligning to Reference Genome: Exon First Alignment

Exon 1

Exon 2

Exon 3

KB


Aligning to reference genome exon first alignment1

Aligning to Reference Genome: Exon First Alignment

Exon 1

Exon 2

Exon 3

KB


Aligning to reference genome exon first alignment2

Aligning to Reference Genome: Exon First Alignment

Exon 1

Exon 2

Exon 3

KB


Exon first alignment pseudo gene problem

Exon First Alignment: Pseudo-gene problem

Exon 1

Exon 2

Exon 3

Processed Pseudo-gene

Exon 2

Exon 3

Exon 1

KB


Exon first alignment pseudo gene problem1

Exon First Alignment: Pseudo-gene problem

Exon 1

Exon 2

Exon 3

Processed Pseudo-gene

Exon 2

Exon 3

Exon 1

KB


Exon first alignment pseudo gene problem2

Exon First Alignment: Pseudo-gene problem

Exon 1

Exon 2

Exon 3

Processed Pseudo-gene

Exon 2

Exon 3

Exon 1

KB


Align to genome or transcriptome1

Align to Genome or Transcriptome?

Genome

Advantages: Can align novel isoforms.

Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

Transcriptome

KB


Reference transcriptome

Reference Transcriptome

Exon 1

Gene 1

Exon 2

Exon 3

Isoform 1

Isoform 2

Isoform 3

Exon 1

Gene 2

Exon 2

Exon 3

Exon 4

Isoform 1

Isoform 2

Isoform 3

KB


Align to genome or transcriptome2

Align to Genome or Transcriptome?

Genome

Advantages: Can align novel isoforms.

Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

Transcriptome

Advantages: Easy, Focused to the part of the genome that is known to be transcribed.

Disadvantages: Reads that come from novel isoforms may not align at all or may be

misattributed to a known isoform.

KB


Better approach aligning to transcriptome and genome

Better Approach: Aligning to Transcriptome and Genome

Align to Transcriptome First

Align the remaining reads to Genome next

Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems

KB

RUM, TopHat2, STAR

Advantages: Can align novel isoforms.

Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes


Conclusions

Conclusions

  • There is no perfect aligner. Pick one well-suited to your application.

    • E.g. Want to identify novel exons? Don’t align only to the known set of isoforms.

  • Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment.

  • Consider running the same fastq files through multiple alignment pipelines specific to each application.

    • Gene expression -> Bowtie to transcriptome

    • Exon discovery -> RUM or other hybrid mapper

    • Variant detection -> GSNAP, GATK, Samtoolsmpileup

  • If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold.

KB


Output of most aligners bam sam file of reads and genome positions

Output of most aligners: Bam/Sam file of reads and genome positions

S


Visualization of alignment data bam sam

Visualization of alignment data (BAM/SAM)

  • Genome browsers – UCSC, IGV, etc.


Igv is your friend

IGV is your friend.

SNP

Coverage density plot

Read color = strand


Aligned reads to gene abundance

Aligned Reads to Gene Abundance

Total RNA

100bp Reads

Aligned Reads

Quantified isoform and gene expression

N


Aligned reads to gene abundance challenges

Aligned Reads to Gene Abundance: Challenges

Long

Short

Many approaches to quantify expression abundance

N


Rna s eq from experimental design to gene expression

Aligned Reads to Gene Abundance: Challenges

200

1000 reads

1

Long

100

2

Medium

50

3

Short

Relative abundance for these genes, f1, f2, f3

N


Rna s eq from experimental design to gene expression

Aligned Reads to Gene Abundance: Challenges

400

200

1

Long

100

400

2

Medium

50

200

3

Short

Relative abundance for these genes, f1, f2, f3

N


Rna s eq from experimental design to gene expression

Aligned Reads to Gene Abundance: Challenges

400

200

1

Long

100

400

2

Medium

50

200

3

Short

Relative abundance; f1=0.2, f2=0.4, f3=0.4

N


Multireads reads mapping to multiple genes transcripts

Multireads: Reads Mapping to Multiple Genes/Transcripts

Long

Medium

N


Rna s eq from experimental design to gene expression

Gene multireads

Gene A

Gene C

Gene B

Reads that map to multiple genomic locations

N


Rna s eq from experimental design to gene expression

Isoform multireads

Exon 1

Exon 2

Exon 3

Isoform 1

Isoform 2

Isoform 3

N


Rna s eq from experimental design to gene expression

Multireads: Reads Mapping to Multiple Genes/Transcripts

350

200

1

Long

150

100

300

2

Medium

Multireads

50

200

3

Short

Unique

Relative abundance for these genes, f1, f2, f3

N


Approach 1 ignore multireads

Approach 1: Ignore Multireads

350

200

1

Long

150

100

300

2

Medium

50

200

3

Short

Relative abundance for these genes, f1, f2, f3

Nagalakshmi et. al. Science. 2008

Marioni, et. al. Genome Research 2008

N


Approach 1 ignore multireads1

Approach 1: Ignore Multireads

350

200

1

Long

150

100

300

2

Medium

50

200

3

Short

  • Over-estimates the abundance of genes with unique reads

  • Under-estimates the abundance of genes with multireads

  • Not an option at all, if interested in isoform expression

N


Approach 2 allocate fraction of multireads using estimates from uniques

Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques

350

200

1

Long

150

100

300

2

Medium

50

200

3

Short

Relative abundance for these genes, f1, f2, f3

Ali Mortazavi, et. al. Nature Methods 2008

RSEM,Cufflinks

N


Multireads reads mapping to multiple genes transcripts1

Multireads: Reads Mapping to Multiple Genes/Transcripts

Long

Medium

N


Conclusions for quantitation

Conclusions for quantitation

  • Isoform-level estimates will become easier as read length increases.

  • EM approaches are currently the best option.

  • Bad alignment = bad quantitation

  • Pseudogenes are problematic.

  • Check your alignments -> adjust method. Iterative process.

N


Normalization

A speed bump on the road from raw counts to differential expression.

Normalization

S


Large pool small sample problem

Large pool, small sample problem

  • Typical RNA library estimated to contain 2.4 x 1012 molecules. McIntyre et al 2011

  • Typical sequencing run = 25 million reads/sample.

  • This means that only 0.00001 (1/1000th of a percent) of RNA molecules are sampled in a given run.

  • High abundance transcripts are sampled more frequently.

    Example: Albumin = 13% of all reads in liver RNA-seq samples.

  • Sampling errors affect low-abundance transcripts most.

S


A finite pool of reads

A finite pool of reads.

S


Perfect world all transcripts counted

Sample 2

Sample 1

Perfect world:All transcripts counted.

Alb

Low1

Alb

Low1

S


Rna s eq from experimental design to gene expression

Real world: More reads taken up by highly expressed genes means less reads available for lowly expressed genes.

Sample 1

Alb

Low1

S


Rna s eq from experimental design to gene expression

Sample 2

Sample 1

Alb

Alb

Low1

Low1

Highly expressed genes that are differentially expressed can cause lowly expressed genes that are not actually differentially expressed to appear that way.

S


Normalization of raw counts

Normalization of raw counts

  • Wrong way to normalize data

    • Normalizing to the total number of mapped reads (e.g. FPKM). Top 10 highly expressed genes soak up 20% of reads in the liver. FPKM is widely used, and problematic.

  • Better ways to measure data

    • Normalize to upper quartile (75th %) of non-zero counts, median of scaled counts (DESeq), or the weighted trimmed mean of the log expression ratios (EdgeR).

S


Normalizing for gene length longer transcripts will produce relatively more reads when fragmented

Normalizing for gene length: Longer transcripts will produce relatively more reads when fragmented.

Sample 1

Long Gene

Short Gene

Only necessary if you are comparing the relative expression levels of two genes in one sample. TPM = Transcripts per million.

S


Summary

RNA

AGCTA

ATGCTCA

Summary

AGCTA

TAGATGCTCA

AGCTAATC

ATGCTCA

AGCTA

ATGCTCA

AGCTA

AGTAGATGCTCA

AGCTA

ATGCTCA

AGCTA

ATGCTCA

AGCTA

ATGCTCA

TAGATGCTCA

AGCTAATC

CTCA

AGCTAATCCTAG

As sequences get longer, alignment and isoform quantitation becomes easier!


Summary1

RNA

AGCTA

ATGCTCA

Summary

AGCTA

TAGATGCTCA

AGCTAATC

ATGCTCA

AGCTA

ATGCTCA

AGCTA

AGTAGATGCTCA

QC and Analysis Pipeline

AGCTA

ATGCTCA

Experimental Design

AGCTA

ATGCTCA

AGCTA

ATGCTCA

TAGATGCTCA

AGCTAATC

CTCA

AGCTAATCCTAG


Resources

Resources

Aligner

  • Bowtie 2 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

  • GSNAP http://research-pub.gene.com/gmap/

  • Tophathttp://tophat.cbcb.umd.edu/

    Transcript Abundance

  • Cufflinks http://cufflinks.cbcb.umd.edu/

  • IsoEM http://dna.engr.uconn.edu/?page_id=105

  • RSEM http://deweylab.biostat.wisc.edu/rsem/

  • Misohttp://genes.mit.edu/burgelab/miso/

  • Cuffdiff http://cufflinks.cbcb.umd.edu/

  • DESeq http://www-huber.embl.de/users/anders/DESeq/

  • edgeRhttp://bioconductor.org/packages/release/bioc/html/edgeR.html

    Denovo Assembly

  • Trinity http://trinityrnaseq.sourceforge.net/


Acknowledgements

Acknowledgements

  • Narayanan Raghupathy, KB Choi

  • Gary Churchill

  • Ron Korstanje/ Karen Svenson/ Elissa Chesler

  • Joel Graber

  • Anuj Srivastava

  • Churchill Lab – Dan Gatti

  • Al Simons and Matt Hibbs

  • Lisa Somes, Steve Ciciotte, mouse room staff

  • Mice


Thank you

Thank you!


Differential expression

Differential Expression

Too conservative

Over-estimation of

Under-estimation of

Too sensitive

(Many false positives)

Expression abundance tends to overdispersed in RNA-seq measurements. Simple Poisson model would result in under-estimation of variance and many false positive calls.


Differential expression1

Differential Expression

A term that addresses over-dispersion

Poisson

The majority of DE detection methods are trying to estimate this guy in their own way.

(ex) DESeq, edgeR, EBSeq, DSS, NBPSeq, baySeq, etc.


Multiple testing correction and false discovery rate

Multiple Testing Correction and False Discovery rate

XKCD Significant

2012 IgNobel prize in

Neuroscience for “finding

Brain activity signal in dead salmon using fMRI”

N


Northern blot

Northern Blot


  • Login