Schedule change
Sponsored Links
This presentation is the property of its rightful owner.
1 / 34

Schedule change PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

Schedule change. Galaxy server going down for maintenance on Thursday. Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical ( Tophat + Cuffdiff pipeline on Galaxy) Day 3: AM – Introduction to Exome Sequencing and Variant Discovery

Download Presentation

Schedule change

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Schedule change

Galaxy server going down for maintenance on Thursday

  • Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq)

  • Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipelineon Galaxy)

  • Day 3: AM –Introduction to Exome Sequencing and Variant Discovery

  • Day 3: PM - Exome sequence analysis practical (Galaxy)


Quick Recap

  • NGS data production becoming commonplace

  • Many applications -> research intent determines technology platform choice

  • High volume data BUT error prone

  • FASTQ is accepted format standard

  • Must assess quality scores before proceeding

  • ‘Bad’ data can be rescued


Introduction to RNAseq


The Central Dogma of Molecular Biology

Reverse

Transcription


RNAseq Protocols

  • cDNA, not RNA sequencing

  • Types of libraries available:

    • Total RNA sequencing (not advised)

    • polyA+ RNA sequencing

    • Small RNA sequencing (specific size range targeted)


cDNA Synthesis


Genome-scale Applications

  • Transcriptome analysis

  • Identifying new transcribed regions

  • Expression profiling

  • Resequencing to find genetic polymorphisms:

    • SNPs, micro-indels

    • CNVs

    • Question: Why even bother with exome sequencing then?


Sequencing details

  • Standard sequencing

    • polyA/total RNA

    • Size selection

    • Primers and adapters

    • Single- and paired-end sequencing

  • Strand-specific sequencing

    • still immature tech

    • Sequencing only + or – strand

    • Mostly paired-end


What about microarrays??!!!

  • Assumes we know all transcribed regions and that spliceforms are not important

  • Cannot find anything novel

  • BUT may be the best choice depending on QUESTION


Arrays vs RNAseq (1)

  • Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73)

  • Technical replicates almost identical

  • Extra analysis: prediction of alternative splicing, SNPs

  • Low- and high-expressed genes do not match


RNA-Seq promises/pitfalls

  • can reveal in a single assay:

    • new genes

    • splice variants

    • quantify genome-wide gene expression

  • BUT

    • Data is voluminous and complex

    • Need scalable, fast and mathematically principled analysis software and LOTS of computing resources


Experimental considerations

  • Comparative conditions must make biological sense

  • Biological replicates are always better than technical ones

  • Aim for at least 3 replicates per condition

  • ISOLATE the target mRNA species you are after

  • NOT looking for new transcripts can bias expression estimates


Analysis strategies

  • De novo assembly of transcripts:

    + re-constructs actual spliced transcripts

    + does not require genome sequence

    easier to work post-transcriptional modifications

    - requires huge computational resources (RAM)

    - low sensitivity: hard to capture low abundance transcripts

  • Alignment to the genome => Transcript assembly

    + computationally feasible

    + high sensitivity

    + easier to annotate using genomic annotations

    - need to take special care of splice junctions


Basic analysis flowchart

Illumina

reads

Remove

artifacts

AAA..., ...N...

Clip adapters

(small RNA)

"Collapse"

identical

reads

Align

to the

genome

Pre-filter:

low complexity

synthetic

Count

and

discard

Re-align

with different

number of mismatches

etc

un-mapped

mapped

mapped

un-mapped

Assemble:

contigs (exons)

+ connectivity

Filter out low

confidence

contigs

(singletons)

Annotate


Software

  • Short reads aligners

    • Stampy, BWA, Novoalign, Bowtie, TOPHAT

  • Data preprocessing

    • Fastx toolkit

    • samtools

  • Expression studies

    • Cufflinks package

    • R packages (DESeq, edgeR, more…)

  • Alternative splicing

    • Cufflinks

    • Augustus


The ‘Tuxedo’ protocol

  • TOPHAT + CUFFLINKS

  • TopHat aligns reads to genome and discovers splice sites

  • Cufflinks predicts transcripts present in dataset

  • Cuffdiff identifies differential expression

    Very widely adopted suite


‘Tuxedo’ protocol limitations

  • Uses shortread data - Illumina OR SOLiD

  • Requires a sequenced genome

  • No GUI

  • Versions implemented in GALAXY are old(ish)


Read alignment with TopHat


Splice junctions

  • In humans, terminal exons are ~1kb long, and since mRNAs are ~2kb,

  • ~half of the reads should originate in initial and internal exons

  • Initial and internal exons are ~200b long

  • => for 75-mer reads, ~20% of reads are supposed to cross splice junctions

R

RNA:

Lexon

Genome:


Splice junctions strategies

  • Create a splice junctions database joining together donors and acceptors

  • Typically, use known (annotated) splice junctions or known splice sites

  • TopHat: uses putative exons from mapped reads, database is made of canonical splice sites around putative exons


Read alignment with TopHat (2)

  • Uses BOWTIE aligner to align reads to genome

  • BOWTIE cannot deal with large gaps (introns)

  • Tophat segments reads that remain unaligned

  • Smaller segments mostly end up aligning


Read alignment with TopHat (3)

  • When there is a large gap between segments of same read -> probable INTRON

  • Tophat uses this to build an index of probable splice sites

  • Allows accurate measurement of spliceform expression

  • Possibility of detecting gene fusion events


Cufflinks package

  • http://cufflinks.cbcb.umd.edu/

  • Cufflinks:

    • Expression values calculation

    • Transcripts de novo assembly

  • Cuffcompare:

    • Transcripts comparison (de novo/genome annotation)

  • Cuffdiff:

    • Differential expression analysis


Cufflinks: Transcript assembly

  • Assembles individual transcripts based on aligned reads

  • Infers likely spliceforms of each gene

  • Builds ‘transfrags’

    • The smallest number of spliceforms that can be explained by the data

    • NOTE:assembly errors do occur -> sequencing depthhelps


Cufflinks: Transcript assembly (2)

  • Quantifies expression level of each transfrag

  • Filters out those likely to be premature terminations, non-mature mRNAs, etc


Cuffmerge

  • Merges transfrags into transcripts where appropriate

  • Also performs a reference based assembly of transcripts using known transcripts

  • Produces single annotation file which aids downstream analysis


Cuffdiff: Differential expression

  • Calculates expression level in two or more samples

  • Expression level relates to read abundance

  • Because of bias sources, cuffdiff tries to model the variance in its significance calculation

    What else is important?


FPKM (RPKM): Expression Values

  • Fragments Reads Per Kilobase of exon model per Million mapped fragments

  • Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al.

C= the number of reads mapped onto the gene's exons

N= total number of reads in the experiment

L= the sum of the exons in base pairs.


Cufflinks (Expression analysis)

gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status

ENSG00000236743 31390 chr1 459655 461954 0 0 0 OK

ENSG00000248149 31391 chr1 465693 688071 787.12 731.009 843.232 OK

ENSG00000236679 31391 chr1 470906 471368 0 0 0 OK

ENSG00000231709 31391 chr1 521368 523833 0 0 0 OK

ENSG00000235146 31391 chr1 523008 530148 0 0 0 OK

ENSG00000239664 31391 chr1 529832 532878 0 0 0 OK

ENSG00000230021 31391 chr1 536815 659930 2.53932 0 5.72637 OK

ENSG00000229376 31391 chr1 657464 660287 0 0 0 OK

ENSG00000223659 31391 chr1 562756 564390 0 0 0 OK

ENSG00000225972 31391 chr1 564441 564813 96.9279 77.2375 116.618 OK

ENSG00000243329 31391 chr1 564878 564950 0 0 0 OK

ENSG00000240155 31391 chr1 564951 565019 0 0 0 OK


Cuffdiff (differential expression)

  • Pairwise or time series comparison

  • Normal distribution of read counts

  • Fisher’s test

test_idgenelocussample_1sample_2statusvalue_1value_2ln(fold_change)test_statp_valuesignificant

ENSG00000000003TSPAN6chrX:99883666-99894988q1q2NOTEST00001no

ENSG00000000005TNMDchrX:99839798-99854882q1q2NOTEST00001no

ENSG00000000419DPM1chr20:49551403-49575092q1q2NOTEST15.077523.86270.459116-1.395560.162848no

ENSG00000000457SCYL3chr1:169631244-169863408q1q2OK32.562616.5208-0.67854115.81860yes


Visualization: Genome Viewers

  • Web based:

    • UCSC Genome Browser (http://genome.ucsc.edu/)

  • Standalone

    • Integrated Genome Viewer (http://www.broadinstitute.org/software/igv/)


RNAseq hands-on practical (Galaxy)

  • Data QC and trimming

  • Aligning reads to reference genome

  • Running CUFFLINKS and looking at some transcripts using the UCSC genome browser

  • Finding differentially expressed genes with CUFFDIFF


  • Login