Genome sequence assembly

Genome sequence assembly Assembly concepts and methods Mihai Pop Center for Bioinformatics and Computational Biology University of Maryland

Building a library • Break DNA into random fragments (8-10x coverage) Actual situation

Building a library • Break DNA into random fragments (8-10x coverage) • Sequence the ends of the fragments • Amplify the fragments in a vector • Sequence 800-1000 (500-700) bases at each end of the fragment

Assembling the fragments

I II R F I II R F II I R F Forward-reverse constraints • The sequenced ends are facing towards each other • The distance between the two fragments is known (within certain experimental error) Insert R F Clone

Building Scaffolds • Break DNA into random fragments (8-10x coverage) • Sequence the ends of the fragments • Assemble the sequenced ends • Build scaffolds

Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Unifying view of assembly Assembly Scaffolding

Shotgun sequencing statistics

Typical contig coverage Imagine raindrops on a sidewalk

Lander-Waterman statistics L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

Example Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Experimental data Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

Read coverage vs. Clone coverage 4 kbp 1 kbp Read coverage = 8X Clone (insert) coverage = 16 2X coverage in BAC-ends implies 100x coverage by BACs (1 BAC clone = approx. 100kbp)

Assembly paradigms • Overlap-layout-consensus • greedy (TIGR Assembler, phrap, CAP3...) • graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing)

TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done

Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 3 1 1 2 3 1 3 2 2

Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

Implementation details

Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size.

All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible • Build a table of k-mers contained in sequences (single pass through the genome) • Generate the pairs from k-mer table (single pass through k-mer table) k-mer

REPEATS

RptA RptB 3 6 9 12 2 5 8 11 1 4 7 10 13 6 4 8 10 2 12 1 13 3 11 5 9 7

4 6,10 8 2 12 1 13 3 7 11 5,9 Non-repetitive overlap graph

Handling repeats • Repeat detection • pre-assembly: find fragments that belong to repeats • statistically (most existing assemblers) • repeat database (RepeatMasker) • during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) • post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker • "unhappy" mate-pairs (too close, too far, mis-oriented) • Repeat resolution • find DNA fragments belonging to the repeat • determine correct tiling across the repeat

Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

Mis-assembled repeats excision collapsed tandem rearrangement

SASA repeat (4776 AA, 14Kb)from Streptococcus pneumoniae MTETVEDKVSHSITGLDILKGIVAAGAVISGTVATQTKVFTNESAVLEKTVEKTDALATNDTVVLGTISTSNSASSTSLSASESASTSASESASTSASTSASTSASESASTSASTSISASSTVVGSQTAAATEATAKKVEEDRKKPASDYVASVTNVNLQSYAKRRKRSVDSIEQLLASIKNAAVFSGNTIVNGAPAINASLNIAKSETKVYTGEGVDSVYRVPIYYKLKVTNDGSKLTFTYTVTYVNPKTNDLGNISSMRPGYSIYNSGTSTQTMLTLGSDLGKPSGVKNYITDKNGRQVLSYNTSTMTTQGSGYTWGNGAQMNGFFAKKGYGLTSSWTVPITGTDTSFTFTPYAARTDRIGINYFNGGGKVVESSTTSQSLSQSKSLSVSASQSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASGSASTSTSASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASASTSASASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASASTSASASASTSASASASTSASASASISASESASTSASASASASTSASASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASGSASTSTSASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSVSNSANHSNSQVGNTSGSTGKSQKELPNTGTESSIGSVLLGVLAAVTGIGLVAKRRKRDEEE

Genome sequence assembly