1 / 23

A hierarchical approach to building contig scaffolds

A hierarchical approach to building contig scaffolds. Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004. Sequencing pipeline. Random sequencing un-related reads 500-700 base-pairs Assembly un-related contigs 5,000-10,000 base-pairs Scaffolding

casey
Download Presentation

A hierarchical approach to building contig scaffolds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004.

  2. Sequencing pipeline • Random sequencing • un-related reads 500-700 base-pairs • Assembly • un-related contigs 5,000-10,000 base-pairs • Scaffolding • un-related scaffolds 30,000-50,000 base-pairs • Finishing/gap closure • completed genomes millions-billions of base-pairs

  3. II I III IV II III IV I Scaffolding • Given a set of non-overlapping contigsorder and orient them along a chromosome

  4. I II R F I II R F II I R F Clone-mates Insert R F Clone

  5. Scaffolder output Physical gaps Sequencing gaps

  6. Problems with the data • Incorrect sizing of inserts • cut from gel – sizing is subjective • error increases with size • Chimeras (ends belong to different inserts) • biological reasons (esp. for large sized inserts) • sample tracking (human error) • Software must handle a certain error rate.

  7. Theoretical abstraction • Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.

  8. Graph representation • Nodes: contigs • Directed edges: constraints on relative placement of contigs – relative order and relative orientation • Embedding: order (coordinate along chromosome) and orientation (strand sampled)

  9. Challenges • Orientation – node coloring problem (forward/reverse) • feasibility – no cycles with odd number of “reversal” edges (blue edges) • optimality – remove minimum number of edges such that a solution exists (NP-hard)

  10. Challenges • Ordering – generate a linear embedding • feasibility – lengths of parallel DAG paths are consistent • optimality – remove minimum number of edges such that DAG is feasible (NP-hard)

  11. The real world • Use of scaffolds • Analysis – longest unambiguous sub-graphs • Finishing – present all “reliable” relationships between contigs • Sources of error • mis-assemblies • sizing errors (increases with library size) • chimeras

  12. II II II II I’ I I I I III III III III Ambiguous scaffold

  13. Hierarchical scaffolding • For each contig pair, consolidate all linking data into a single relationship – 2 correct links required

  14. Hierarchical scaffolding • Use most reliable links to build scaffolds • Repeatedly build super-scaffolds based on less reliable linking data

  15. Rationale Problem complexity problem size (#nodes) error rate Hierarchical step

  16. Linking information • Overlaps • Mate-pair links • Similarity links • Physical markers • Gene synteny reference genome physical map

  17. BAMBOO (BAMBUS) Best effort Attempt Multiple Branches allowed Order, Orient

  18. BE EB min <= dist <= max Inputs • Set of contigs: names and lengths • Groups of contig links: • groups correspond to “quality” of links • link: relative distance between contig origins relative orientation of contigs • Priorities for each group – specify order in which links are considered

  19. Outputs • XML representation of layout: • contig orientations • contig position (x-coordinate of contig origin) • links used to construct layout • Graphical display of the layout • uses GraphViz package from AT&T

  20. 1.0 release • XML input not yet supported • All scaffold placed in the same output file • Only Linux executable released • Hacker friendly

  21. Current release: 2.33 • XML input and more general output module • Collection of input modules from common assembly formats • Better handling of priority data • Repeat masking features • More platforms supported and source code released as open source • http://amos.sourceforge.net

  22. Future enhancements • Option to generate un-ambiguous (non-branching) scaffolds • Better layout algorithms • Specialized drawing tools • Interactive browser • Represent/handle multiple haplotypes

  23. Acknowledgements • Dan Kosack • Steven Salzberg • Martin Shumway • Hean Koo • Luke Tallon • Jessica Vamathevan

More Related