1 / 35

Genome Assembly: a brief introduction

Genome Assembly: a brief introduction. Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg. Shotgun DNA Sequencing (Technology). DNA target sample. SHEAR. SIZE SELECT. e.g., 10Kbp ± 8% std.dev. End Reads (Mates). 550bp. LIGATE & CLONE. Primer. SEQUENCE. Vector.

fox
Download Presentation

Genome Assembly: a brief introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg

  2. Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp LIGATE & CLONE Primer SEQUENCE Vector

  3. Whole Genome Shotgun Sequencing + single highly automated process + only three library constructions – assembly is much more difficult • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 10Kbp 2Kbp • Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’

  4. Sequencing Factory

  5. Celera’s Sequencing Factory(circa 2001) • 300 ABI 3700 DNA Sequencers • 50 Production Staff • 20,000 sq. ft. of wet lab • 20,000 sq. ft. of sequencing space • 800 tons of A/C (160,000 cfm) • $1 million / year for electrical service • $10 million / month for reagents

  6. Human Data (April 2000) • Collected 27.27 Million reads = 5.11X coverage • 21.04 Million are paired (77%) = 10.52 Million pairs • 2Kbp 5.045M 98.6% true * <6% std.dev. • 10Kbp 4.401M 98.6% true * <8% std.dev. • 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence • The clones cover the genome 38.7X times • Data is from 5 individuals (roughly 3X, 4 others at .5X)

  7. Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

  8. Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

  9. Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

  10. Shotgun sequencing statistics

  11. Typical contig coverage Imagine raindrops on a sidewalk

  12. Lander-Waterman statistics L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

  13. Example Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

  14. Experimental data Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

  15. Assembly paradigms • Overlap-layout-consensus • greedy (TIGR Assembler, phrap, CAP3...) • graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing)

  16. TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done

  17. Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 3 1 1 2 3 1 3 2 2

  18. Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

  19. Implementation details

  20. Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size.

  21. All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible • Build a table of k-mers contained in sequences (single pass through the genome) • Generate the pairs from k-mer table (single pass through k-mer table) k-mer

  22. Assembly Pipeline A B implies TRUE A B OR A B REPEAT-INDUCED Trim & Screen Find all overlaps  40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II

  23. Assembly Pipeline Trim & Screen Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II

  24. OVERLAP GRAPH A A B B B A B A A B A B Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

  25. The Unitig Reduction A C A B C B 1. Remove “Transitively Inferrable” Overlaps:

  26. The Unitig Reduction A 412 352 A B B 45 2. Collapse “Unique Connector” Overlaps:

  27. Identifying Unique DNA Stretches Repetitive DNA unitig Unique DNA unitig Arrival Intervals Discriminator Statistic is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. +10 -10 0 Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique

  28. Assembly Pipeline Mated reads Scaffold U-unitigs with confirmed pairs Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II

  29. Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II

  30. REPEATS

  31. Handling repeats • Repeat detection • pre-assembly: find fragments that belong to repeats • statistically (most existing assemblers) • repeat database (RepeatMasker) • during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) • post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker • "unhappy" mate-pairs (too close, too far, mis-oriented) • Repeat resolution • find DNA fragments belonging to the repeat • determine correct tiling across the repeat

  32. Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

  33. Mis-assembled repeats excision collapsed tandem rearrangement

More Related