1 / 19

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads. Serghei Mangul Department of Computer Science Georgia State University. Joint work with Adrian Caciula , Sahar Al Seesi , Dumitru Brinza ,

clay
Download Presentation

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia State University Joint work with Adrian Caciula, Sahar Al Seesi, DumitruBrinza, Abdul RoufBanday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

  2. Advances in Next Generation Sequencing Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length http://www.economist.com/node/16349358 Ion Proton Sequencer SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length

  3. RNA-Seq RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression Gene Expression Transcriptome Reconstruction A B C A C D E

  4. Transcriptome Reconstruction • Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

  5. Transcriptome Reconstruction Types • Genome-independent reconstruction (de novo) • de Brujin k-mer graph • Genome-guided reconstruction (ab initio) • Spliced read mapping • Exon identification • Splice graph • Annotation-guided reconstruction • Use existing annotation (known transcripts) • Focus on discovering novel transcripts

  6. Previous approaches • Genome-independent reconstruction • Trinity(2011), Velvet(2008), TransABySS(2008) • Genome-guided reconstruction • Scripture(2010) • Reports “all” transcripts • Cufflinks(2010), IsoLasso(2011), SLIDE(2012) • Minimizes set of transcripts explaining reads • Annotation-guided reconstruction • RABT(2011), DRUT(2011)

  7. Challenges and Solutions • Read length is currently much shorter then transcripts length • Paired-end reads • Fragment length distribution

  8. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 t1 : 1 3 4 5 6 7 t2 : 1 2 3 4 5 7 t3 : 1 3 4 5 7 t4 : Exon 2 and 6 are “distant” exons : how to phase them?

  9. TRIPTransciptomeReconstruction using Integer Programming • Map the RNA-Seq reads to genome • Construct Splice Graph - G(V,E) • V : exons • E: splicing events • Candidate transcripts • depth-first-search (DFS) • Filter candidate transcripts • fragment length distribution (FLD) • integer programming Genome

  10. Gene representation • Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events • Gene - set of non-overlapping pseudo-exons Tr1: e1 e5 Tr2: e1 e3 e5 Tr3: e2 e4 e6 Pseudo-exons: pse2 pse3 pse4 pse5 pse6 pse7 pse1 Epse1 Spse2 Epse3 Spse4 Epse4 Spse5 Epse6 Spse7 Spse1 Spse3 Epse2 Epse5 Spse6 Epse7

  11. Splice Graph pseudo-exons TSS TES Genome 6 7 8 1 9 5 2 4 3

  12. How to select? • Select the smallest set of candidate transcripts that yields a good statistical fit between • the fragment length distribution empirically determined during library preparation • fragment lengths implied by mapped read pairs 500 1 2 3 200 200 200 Mean : 500; Std. dev. 50 300 1 3 Mean : 500; Std. dev. 50 200 200

  13. Simplified IP Formulation • Objective • Constraints T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwise x(p) - 1 if the pe read p is selected to be mapped for each pe read at least one transcript is selected

  14. IP Formulation • Objective • Constraints for each pe read from every category of std.dev. at least one transcript is selected restricts the number of pe reads mapped within different std. dev. each pe read is mapped no more then with one category of std. dev. every splice junction to be covered

  15. Comparison on Simulated Data 100x coverage, 2x100bp pereads, 500 mean fragment length, 10% sd

  16. Influence of Sequencing Parameters • TRIP-L : individual fragment lengths estimates 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd

  17. Results on Real RNA-Seq Data • CD1 mouse retina RNA samples • specific gene that has 33 annotated transcripts in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using qPCR. • Cufflinks : 3 out of 10 transcripts 6906 alignments for 22346 read pairs with read length of 68

  18. Conclusions • We introduced a novel method for transcriptomereconstruction from paired-end RNA-Seqreads. • Our method : • exploits distribution of fragment lengths • additional experimental data • TSS/TES (TRIP with TSS/TES) • individual fragment lengths estimates (TRIP-L)

More Related