1 / 36

RNA- Seq Assembly: Fundamental Limits, Algorithms and Software

RNA- Seq Assembly: Fundamental Limits, Algorithms and Software. David Tse Stanford University Symposium on Turbo Codes and Iterative Information Processing Bremen, Germany August 20, 2014. Joint work with Sreeram Kannan and Lior Pachter .

kylar
Download Presentation

RNA- Seq Assembly: Fundamental Limits, Algorithms and Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA-Seq Assembly:Fundamental Limits, Algorithms and Software David Tse Stanford University Symposium on Turbo Codes and Iterative Information Processing Bremen, Germany August 20, 2014 Joint work with SreeramKannan and LiorPachter. Research supported by NSF Center for Science of Information. TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

  2. Communication system design 1) Establish fundamental limits. 2) Design codes and algorithms to approach the limit. 3) Implement a system. We apply this methodology to the RNA-Seq assembly problem.

  3. DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

  4. High throughput sequencing revolution

  5. Shotgun sequencing read

  6. Sequencing Technologies

  7. High throughput sequencing:Microscope in the big data era Assembly Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. Today’s focus: RNA sequencing.

  8. Central dogma of molecular biology RNA transcripts and their abundances capture the dynamic state of a cell at a given time. transcription translation DNA RNA Protein

  9. From DNA to RNA AC TGAA AGC DNA ATC GAT CAT TCG Exon Intron 1000’s to 10,000’s symbols long ATC CAT TCG GAT TCG RNA Transcript 2 RNA Transcript 1 Alternative splicing yields different isoforms.

  10. Transcriptome 20 copies in cell ATC CAT TCG GAT TCG 30 copies in cell • Different transcripts are present at different abundances. • Transcriptome is the mixture of transcripts from all the genes. • Human transcriptome has 10,000’s of transcripts from • 20,000 genes.

  11. RNA-Seq (Mortazavi et al, Nature Methods 08)

  12. RNA-Seq assembly Transciptome Reads ATC ATC CAT CAT TCG TCG TTC GAT GAT GAT TCG TCG TCG GAT Assembler reconstructs TCG

  13. RNA assembly: state-of-the-art Popular assemblers diverge significantly when fed the same input 448216 6457 24243 IsoLasso 9741 Cufflinks 7553 5588 Scripture 59647 • Source: Wei Li et al, JCB 2011, Data from ENCODE project

  14. Assemblyas a software engineering problem • A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. • Primary concerns are to minimize time and memory requirements. • No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

  15. A new approach • Establish information theoretic limits under simplifying assumptions. • Design an assembly algorithm that achieves close to the limits. • Build software and test on simulated and real data.

  16. Information theoretic limits Basic question: What is the length, numberand error rateof the reads needed for reliable reconstruction of a transcriptome? A simplified question: What is the minimum read length Lcriticalneeded, assuming infinitenoiseless reads? (cf. earlier work on DNA assembly: Bresler, Bresler and T. 2013 BMC Bioinformatics, Motahari et al 2013 ISIT)

  17. Sequencing Technologies

  18. Lcrit depends on repeats Lcritical is a measure of repeat complexity of the transcriptomefrom the point of view of assembly.

  19. What is Lcritical for a transcriptome? Lcriticaldepends on: • intra-transcript repeats • inter-transcript repeats on the transcriptome.

  20. Intra-transcript repeats:interleaved repeats a single transcript L-1 L-1 L-1 L-1 L Lcriticalis lower bounded by the length of the longest intra-transcript interleaved repeat.

  21. Inter-transcript repeats Lcriticalis typically much larger due to inter-transcript repeats of exons across isoforms. ATC ATC CAT CAT TCG TCG 100’s of symbols GAT GAT GAT TCG TCG TCG

  22. Ambiguity due to inter-transcript repeats L = read length s4 s3 s1 transcript 1 L-1 s5 s3 s2 transcript 2 L-1 s3 s1 s4 s3 s5 s2

  23. Ambiguity due to inter-transcript repeats L = read length s3 s1 s4 transcript 1 L-1 s3 s5 s2 transcript 2 L-1 s3 s1 s4 transcript 3 s3 s5 s1 transcript 4

  24. Abundance diversity lymphoblastoid cell line Geuvadis dataset

  25. Equal abundance Generic abundances s3 s4 s1 s3 s4 s1 L-1 ? s3 s2 s5 s3 s2 s5 a L-1 Unique generic solution, also sparse c b s1 s3 s5 s1 s3 s5 L-1 s3 s2 s4 s3 s2 s4 b L-1

  26. Unresolvable intra-transcript repeats with generic abundances abundances s3 s3 s4 s4 s1 s1 a b s3 s4 s2 c s3 s5 s2 a-c alternative solution: b+c s3 s4 s2 c s3 s5 s1 Yields a lower bound for Lcriticalfor a given transcriptome.

  27. Algorithm: reduction to sparsest flow • Create a splice graph where each node is an exon. • Read copy counts give edge flows • Transcripts are extracted via solving a sparsest flow problem. 0.12 s3 s4 s1 s1 s4 0.12 0.12 s3 0.88 0.88 0.88 s2 s5 s3 s5 s2

  28. Sparsest Flow Decomposition • Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] • Closer look at hard instances: most paths have same flow • Equivalent to: Most transcripts have same abundance (!) • This is not characteristic of thebiological problem • Our Result: • Assume that abundances are generic • Propose aprovably correct algorithm that reconstructs when • L > Lsuff • Algorithm is linear time under this condition.

  29. Informational limits: summary Lcriticalof a transcriptome: No algo. can reconstruct Proposed algo. can reconstruct in linear time Lcritical Read Length, L 0 On many reference transcriptomes, these two bounds match, establishing Lcritical!

  30. From theory to software

  31. ShannonRNA: simulated reads Chr 15 Gencode reference transcriptome, 1700 transcripts L= 100, 1M reads, 1% error rate Sensitivity (fraction of transcripts recovered) Specificity (false positive rate) Coverage depth of transcripts

  32. Performance on real reads • RNA sample from Human Embryonic Stem Cell • Simultaneously sequenced using long Pacbio reads and short Illumina reads • Long reads are fewer in number • Read length=50, 20 Million reads • Long read assembly as a proxy for ground truth. • [Au et al, PNAS 2013]

  33. ShannonRNA: real reads No. Transcripts : 800 ShannonRNA : 527 Trinity : 476 Sensitivity (fraction of transcripts recovered) Running Time Trinity: 3 hrs ShannonRNA: 5 hrs Coverage Depth of Transcripts

  34. Zooming In Abundance of Transcripts Reconstructed (Segregated by number of Isoforms)

  35. Summary • An approach to RNA assembly design based on principles of information theory. • Driven by and tested on transcriptomics data. • Goal is to build robust, scalable software with performance guarantees.

  36. Acknowledgements SreeramKannan Berkeley LiorPachter Berkeley Joseph Hui Berkeley KayvonMazooji Berkeley

More Related