1 / 34

CS 6293 Advanced Topics: Current Bioinformatics

CS 6293 Advanced Topics: Current Bioinformatics. Genome Assembly: a brief introduction. Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg. Homework #2. #1: questions will be posted online before Monday class #2: Form groups of 3 Each group reads two papers on a topic:

fiona
Download Presentation

CS 6293 Advanced Topics: Current Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 6293 Advanced Topics: Current Bioinformatics Genome Assembly: a brief introduction Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg

  2. Homework #2 • #1: questions will be posted online before Monday class • #2: Form groups of 3 • Each group reads two papers on a topic: Short reads alignment or assembly • Present the papers and do some comparison • ~8 minutes presentation • You can choose to go to some really cool details • Or give the main idea of the paper • Other teams (and me) will judge you • Send me names in your group and optionally papers you want to present • List of papers: http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

  3. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT ~500 nucleotides Genome sequencing 3x109 nucleotides

  4. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Genome sequencing 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome

  5. Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp LIGATE & CLONE Primer SEQUENCE Vector

  6. Whole Genome Shotgun Sequencing + single highly automated process + only three library constructions – assembly is much more difficult • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 10Kbp 2Kbp • Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’

  7. Sequencing Factory

  8. Celera’s Sequencing Factory(circa 2001) • 300 ABI 3700 DNA Sequencers • 50 Production Staff • 20,000 sq. ft. of wet lab • 20,000 sq. ft. of sequencing space • 800 tons of A/C (160,000 cfm) • $1 million / year for electrical service • $10 million / month for reagents

  9. Collected 27.27 Million reads = 5.11X coverage 21.04 Million are paired (77%) = 10.52 Million pairs 2Kbp 5.045M 98.6% true * <6% std.dev. 10Kbp 4.401M 98.6% true * <8% std.dev. 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence The clones cover the genome 38.7X times Data is from 5 individuals (roughly 3X, 4 others at .5X) Human Data (April 2000)

  10. Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

  11. Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

  12. Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap 12

  13. Assembly paradigms Overlap-layout-consensus greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) Eulerian path (especially useful for short read sequencing) 13

  14. TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done 14

  15. (A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach. Pop M Brief Bioinform 2009;10:354-366 © The Author 2009. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org

  16. Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 16

  17. Paths through graphs and assembly Hamiltonian circuit: visit each node (city) exactly once, returning to the start Hamiltonian path: visit each node (city) exactly once Genome

  18. Overlap between two sequences overlap (19 bases) overhang (6 bases) GGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGC overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size. 18

  19. All pairs alignment Needed by the assembler Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) k-mer 19

  20. BWT-based overlap detection • Efficient construction of an assembly string graph using the FM-index, Jared T. Simpson and Richard Durbin, Bioinformatics, 26 (12): i367-i373 (2010) • Read it yourself for more details ACT ACT$...... ACT….. ACT….. $ ACT…. ACT BWT for multiple sequences

  21. OVERLAP GRAPH A A B B B A B A A B A B Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

  22. The Unitig Reduction A C A B C B 1. Remove “Transitively Inferrable” Overlaps:

  23. The Unitig Reduction A 412 352 A B B 45 2. Collapse “Unique Connector” Overlaps:

  24. Celera Assembly Pipeline A B implies TRUE A B OR A B REPEAT-INDUCED Trim & Screen Find all overlaps  40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II

  25. Celera Assembly Pipeline Trim & Screen Compute all overlap consistent sub-assemblies: Unitigs(Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II

  26. Celera Assembly Pipeline Mated reads Scaffold U-unitigs with confirmed pairs Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II

  27. Celera Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II

  28. Handling repeats Repeat detection pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat 28

  29. Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives 29

  30. Mis-assembled repeats excision collapsed tandem rearrangement 30

  31. Eulerian path-based assembly • Break each read into k-mers (typically k >= 19) • Construct a de Bruijn graph using the k-mers from all reads • Each k-mer is a node • v1 has a directed edge to v2 if v1 can be expressed by removing the last char from v2 and adding a new char at the beginning of v2, E.g. v1 = acgtctgact v2 = cgtctgactg • Find a Eulerian path in the graph • visits each edge exactly once

  32. 4. Error removal 3. Simplification 1. Sequencing 2. Constructing a de Bruijn graph

  33. Eulerian path-based assembly • No need to compute pairwise overlaps – important for NGS data • Eulerian paths are much easier to find than Hamiltonian path • Catch: multiple Eulerian paths may exist • Loss of information • Repeats appear as cycles in the graph • Less likely to cause mis-assembly • More suitable for short-reads assembly • Newbler • VELVET • EDENA • ABySS • See Flicek & Birney, Nat Methods, 2009

  34. References • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) • Genome assembly reborn: recent computational challenges, Mihai Pop, Briefings in Bioinformatics, 10(4): 354-366 (2009)

More Related