1 / 23

Scaffolding Large Genomes Using Integer Linear Programming

Scaffolding Large Genomes Using Integer Linear Programming. James Lindsay* , Hamed Salooti , Alex Zelikovski , Ion Mandoiu *. University of Connecticut*. Georgia State University. De-novo Assembly Paradigm. The Reads. The Genome. S equencing. Assembly. The Scaffolds. S caffolding.

kylia
Download Presentation

Scaffolding Large Genomes Using Integer Linear Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaffolding Large Genomes Using Integer Linear Programming James Lindsay*, HamedSalooti, Alex Zelikovski, Ion Mandoiu* University of Connecticut* Georgia State University

  2. De-novo Assembly Paradigm The Reads The Genome Sequencing Assembly The Scaffolds Scaffolding The Contigs

  3. Why Scaffolding? No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!

  4. Why Scaffolding? Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!

  5. Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!

  6. Read Pairs Informative Reads Paired Read Construction 2kb 2kb same strand and orientation R2 R1 • Align each read against the contigs • Only accept uniquely mapped reads • Use the non-unique reads later • Both reads in a pair must map to different contigs

  7. Linkage Information Possible States 5’ 3’ R2 R1 A B C D contigi contig j • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states

  8. The Scaffolding Problem • Given • Contigs • Paired reads • Find • Orientation • Ordering • Relative Distance • Goal • Recreate true scaffolds • Possible Objectives • Un-weighted • Max number of consistentread pairs • Weighted • Each states is weighted: • Overlap with repeat • Deviation of expected distance • …

  9. Graph Representation E, set of Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected

  10. Integer Linear Program Formulation Variables Contig Orientation: Pairwise Contig Consistency: Contig Pair State: ,, Objective Maximize weight of consistent pairs

  11. Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly

  12. Graph Decomposition: Articulation Points solve Articulation point

  13. Graph Decomposition: 2-cuts 2-cut + + - - + - + -

  14. Non-Serial Dynamic Programming • SPQR-tree to scheduledecomposition • Traverse tree using DFS • NSDP utilizes solutions of previous stage in current stage

  15. Largest Connected Component

  16. Largest Biconnected Component

  17. Largest Triconnected Component

  18. Post Processing ILP Solution outgoing incoming A A B B C C D D E E ILP Solution F F B D F A E C May have cycles Not a total ordering for each connected components • Bipartite matching • Objectives: • Max weight • Max cardinality • Max cardinality / Max weight

  19. Testing Framework Venter Genome • 4x Assembly

  20. Testing Metrics • Computer Scientists • Finding Scaffold = Binary Classification Test • n contigs, try to predict n-1 adjacencies • TP,FP,TN,FN, Sensitivity, PPV • Biologists (main focus) • N50 (basically average scaffold size, ignore gaps) • TP50 • Break scaffold at incorrect edges, then find N50

  21. Results

  22. Conclusions • Success • ILP solves scaffolding problem! • NSDP works. • Improvements • Finalize large test cases (then publish?!) • Practical considerations (read style, multi-libraries, merge ctgs) • Future Work • Where else can I apply NSDP? • Scaffold before assembly?? • Structural Variation??

  23. Questions?

More Related