An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia State University Joint work with Adrian Caciula, Sahar Al Seesi, DumitruBrinza, Abdul RoufBanday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

Advances in Next Generation Sequencing Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length http://www.economist.com/node/16349358 Ion Proton Sequencer SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length

RNA-Seq RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression Gene Expression Transcriptome Reconstruction A B C A C D E

Transcriptome Reconstruction • Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

Transcriptome Reconstruction Types • Genome-independent reconstruction (de novo) • de Brujin k-mer graph • Genome-guided reconstruction (ab initio) • Spliced read mapping • Exon identification • Splice graph • Annotation-guided reconstruction • Use existing annotation (known transcripts) • Focus on discovering novel transcripts

Previous approaches • Genome-independent reconstruction • Trinity(2011), Velvet(2008), TransABySS(2008) • Genome-guided reconstruction • Scripture(2010) • Reports “all” transcripts • Cufflinks(2010), IsoLasso(2011), SLIDE(2012) • Minimizes set of transcripts explaining reads • Annotation-guided reconstruction • RABT(2011), DRUT(2011)

Challenges and Solutions • Read length is currently much shorter then transcripts length • Paired-end reads • Fragment length distribution

1 2 3 4 5 6 7 1 2 3 4 5 6 7 t1 : 1 3 4 5 6 7 t2 : 1 2 3 4 5 7 t3 : 1 3 4 5 7 t4 : Exon 2 and 6 are “distant” exons : how to phase them?

TRIPTransciptomeReconstruction using Integer Programming • Map the RNA-Seq reads to genome • Construct Splice Graph - G(V,E) • V : exons • E: splicing events • Candidate transcripts • depth-first-search (DFS) • Filter candidate transcripts • fragment length distribution (FLD) • integer programming Genome

Gene representation • Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events • Gene - set of non-overlapping pseudo-exons Tr1: e1 e5 Tr2: e1 e3 e5 Tr3: e2 e4 e6 Pseudo-exons: pse2 pse3 pse4 pse5 pse6 pse7 pse1 Epse1 Spse2 Epse3 Spse4 Epse4 Spse5 Epse6 Spse7 Spse1 Spse3 Epse2 Epse5 Spse6 Epse7

Splice Graph pseudo-exons TSS TES Genome 6 7 8 1 9 5 2 4 3

How to select? • Select the smallest set of candidate transcripts that yields a good statistical fit between • the fragment length distribution empirically determined during library preparation • fragment lengths implied by mapped read pairs 500 1 2 3 200 200 200 Mean : 500; Std. dev. 50 300 1 3 Mean : 500; Std. dev. 50 200 200

Simplified IP Formulation • Objective • Constraints T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwise x(p) - 1 if the pe read p is selected to be mapped for each pe read at least one transcript is selected

IP Formulation • Objective • Constraints for each pe read from every category of std.dev. at least one transcript is selected restricts the number of pe reads mapped within different std. dev. each pe read is mapped no more then with one category of std. dev. every splice junction to be covered

Comparison on Simulated Data 100x coverage, 2x100bp pereads, 500 mean fragment length, 10% sd

Influence of Sequencing Parameters • TRIP-L : individual fragment lengths estimates 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd

Results on Real RNA-Seq Data • CD1 mouse retina RNA samples • specific gene that has 33 annotated transcripts in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using qPCR. • Cufflinks : 3 out of 10 transcripts 6906 alignments for 22346 read pairs with read length of 68

Conclusions • We introduced a novel method for transcriptomereconstruction from paired-end RNA-Seqreads. • Our method : • exploits distribution of fragment lengths • additional experimental data • TSS/TES (TRIP with TSS/TES) • individual fragment lengths estimates (TRIP-L)

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

Presentation Transcript

RNA-Seq

RNA- seq Analysis

Paired Programming!

RNA- Seq Lab

RNA seq (I)

Le RNA-seq

“BIG DATA” from RNA- Seq Experiments

Software for Robust Transcript Discovery and Quantification from RNA- Seq

Supplementary Figure 1 – Mapped contigs from reassembled paired end reads to the cranberry

An Integer Programming Approach to Instruction Scheduling

Integer Linear Programming approach to Scheduling

RNA-seq data

RNA-Seq datasets

Integer Programming (an example from reserve design)

Measuring RNA transcript levels

Reconstruction of Gene Regulatory Networks from RNA-Seq Data

From Reads to Results Exome-seq analysis at CCBR

Integer Programming (an example from reserve design)

RNA-SEQ