RNA seq (I) - PowerPoint PPT Presentation

arlen
rna seq i n.
Skip this Video
Loading SlideShow in 5 Seconds..
RNA seq (I) PowerPoint Presentation
Download Presentation
RNA seq (I)

play fullscreen
1 / 64
Download Presentation
RNA seq (I)
211 Views
Download Presentation

RNA seq (I)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. RNA seq (I) Edouard Severing

  2. A typical heat stress experiment (climate change) Economically important frog Heat stress (convection) Control 85 minutes How does the frog adapt and survive? 5 days

  3. Coping with heat stress • The frog likely has to change several processes in order to cope with the heat stress. • Adaptation of metabolic pathways. • Prevent water loss through skin • Changing the concentration of several enzymes, other proteins and molecules. • We want to determine these molecule concentration changes • Starting with proteins.

  4. Changes at the molecular level • We could measure protein concentration directly • Not often done on a large scale • We could measure changes in the expression of the genes that encode these proteins. • Gene expression can be approximated by measuring the amount of mRNA molecules that are produced by the gene.

  5. Gene count and complexity 20.000 genes 25.000 genes

  6. From genes to proteins (I) Initial assumption N Protein coding genes N mRNA Molecules N Proteins Assumption is based on studies that were performed on bacterial systems

  7. From genes to proteins (II) Current view N Proteincoding genes ? N Proteins X N mRNA Molecules What happens here ?

  8. Splicing Pre-mRNA 5’- -3’ 5’- Exon Exon Exon -3’ Intron Intron Gene Splicing mRNA 5’- Exon Exon Exon -3’

  9. Alternative splicing Pre-mRNA 5’- -3’ -3’ 5’- 5’- -3’ Splicing Splicing 5’- -3’ 5’- -3’

  10. Gene count and complexity 90% of genes have AS 60% of genes have AS The average number of transcripts produced by human genes is also higher than the average number of transcripts produced by plant genes

  11. An extreme case Dscam gene produces over 38,000 different transcripts

  12. Major alternative splicing event types In humans exon skipping is most frequent AS event type In plants intron retention are the most common AS event type Humans Exon skipping Plants Intron retention

  13. RNA editing A Primary transcript (Predicted sequence) C U C 5’- A G U - 3’ A RNA-Editing After editing (Observed sequence) A C U U 5’- A G U - 3’ A Difficulty: Distinguish genuine RNA-editing from sequencing errors

  14. Not everything is translated • A large fraction (>30%) of transcripts of protein coding genes are degraded by the nonsense-mediated decay (NMD) pathway. • The position of the stop codon is used to predict whether a transcript is likely to be degraded by the NMD pathway

  15. Detecting putative NMD candidates 5’- -3’ Pre-mRNA mRNA 5’- -3’ Exon/Exon junctions M Stop Open reading frame 5’- -3’ d > 50-55nt

  16. Remember • The number of unique mRNA molecules is much larger than the number of genes. • A large fraction of the mRNA molecules is degraded by the NMD pathway. • NMD provides a means to regulate gene-expression at the post-transcriptional level

  17. Process the frogs into reads for analysis Sequencing Grind N2 Prepare for sequencing >s1 ATCGTAGGGTA >s2 ATGGCCTAGGT Bioinformatics

  18. Basic transcriptome analysis steps • Many research questions require the following steps: • Reconstruction of the transcriptome • We usually only have fragments • Quantification of the transcriptome • Differential expression analysis • Other fun stuff.

  19. de novo transcriptome reconstruction (I)

  20. de novo transcriptome reconstruction (II)

  21. Genome-guided transcriptome reconstruction Genome -3’ 5’- mRNA

  22. Genome-guided transcriptome reconstruction

  23. Genome-guided transcriptome reconstruction

  24. Remember • de novo transcriptome assembly • When no reference genome is available • Finding features which are not on the reference genome (tDNA insertion) • Programs: Trinity, Trans-ABySS, Velvet Oases • Genome-guided transcriptome reconstruction • Reference genome is available with or without annotation • Mapping programs: TopHat, GSNAP • Transcriptome reconstruction: Scripture, Cufflinks

  25. RNA seq (II) Quantification Edouard Severing

  26. A typical heat stress experiment Heat stress (convection) Control

  27. Raw counts • Counting number of reads/fragments falling with exonic regions of a gene. • Example: HTseq-count Exon 4 Exon 3 Exon 2 Exon 1

  28. The same fragment count yet different expression levels Exon 1 library Exon 1 Library size matters library

  29. The same fragment count yet different expression levels. Exon 4 Exon 3 Exon 2 Exon 1 Exon 1 Transcript/gene length matters

  30. Normalizing/correcting for feature length and library size Reads mapped to region RPKM ≈ 1.7 300 nt Feature length 10,000,000 All mapped reads

  31. Normalizing/correcting for feature length FPKM is analogous to RPKM RPKM = 2 FPKM = 1 RPKM = 1 Different picture emerges from raw counts and RPKM/FPKM values

  32. Counting method issues • What to do with reads that map to multiple isoforms (alternative splicing) or genes Pure Random assignment? No, expression can differ Count multiple time? No, it has been derived from a single transcript Gene 2 Gene 1 Isoform 1 Isoform 1 Isoform 2

  33. Count issues: Back to the gene level (I)

  34. Count issues: Back to the gene level (II)

  35. Statistical methods: Expression levels of transcripts

  36. Fishing in the dark lake experiment Question: What fraction (t) of the fish in the lake is green? Method: We catch a number of fish and determine what fraction is green. Caution: Fish have to be immediately thrown back in the water.

  37. Fishing in the dark lake results (I) Sane people would do: Sample(X) Fraction of fish that is green t = 1/3

  38. Fishing in the dark lake results Maximum likelihood estimate of t Sample(X) Maximum likelihood estimate of t The probability of observing our sample X given a certain t: Find a t that maximizes the probability of our observation P(t)) t

  39. Fishing in a complex dark lake. Transcript quantification using RNAseq is like fishing in a dark lake with fragmented fish. We are also forced to determine the possible origin(s) of the fish fragments Only lost an eye and a vin but not its tail

  40. Estimating relative transcript abundances Target α1 Transcript 1 Fragmentation α2 Transcript 2 Observation Sequencing >s1 ATCGTAGGGTA >s2 ATGGCCTAGGT Read mapping Which values of the α1 and α2 gives the highest probability of observing these reads. (α1 + α2 = 1 )

  41. Maximum likelihood alignments • The likelihood of our observation (ʎ) corresponds to the product of observing each of the individual mapped reads (rj ) in our set (R) R

  42. Probability of observing a read • Probability of observing a read rj is the sum of the individual probabilities that a read originates from each transcript (t) in our transcript set (T). Probability that rjoriginated from transcript t Read j

  43. Component 1: Compatibility Does read j map to transcript t t=1 Kj1 = 1 Kj2= 1 t=2 t=3 Kj3= 0

  44. Component 2: Sequencing a read from a specific transcript Probability of “sequencing” a read from transcript t Product of the relative expression level and length of transcript t

  45. Component 2: Sequencing a read from a specific transcript • Longer transcripts produce more fragments than shorter transcripts at equal expression levels. Why and not just ? α1 Fragmentation α2 Fragments α1 = α2

  46. Component l1 = 200; α1 = 0.3 l2 = 150; α2 = 0.2 l3 = 50; α3 = 0.5 Adjust for length normalize

  47. Component 3: Probability of originating from position q on transcript t In the case of no bias:

  48. Components: Fragment comes from a certain position of the transcript (I) 300nt 200nt Occurence More likely

  49. Components: Fragment comes from a certain position of the transcript (II) Frequency Frequency Not all regions are equally covered. Frequency Frequency

  50. Search for abundances that best explain the observed fragments The method used to find the optimum differs per program. Trapnell et al. 2010