Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center

Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center

A typical Microbial (and not only) project Sequencing Draft assembly Goals: Completely restore genome Produce high quality consensus FINISHING Annotation Public release

Sequencing Technology at a Glance

Sanger only 4x of 3kb plasmids + 4x of 8kb plasmids + 1x of fosmids ~ $50k for 5MB genome draft Evolution of Microbial Drafts • Hybrid Sanger/pyrosequence/Illumina • 4x 8kb Sanger + 15 x coverage 454 shotgun + 20x Illumina (quality improvement) • ~ $35k for 5MB genome draft 454 + Solexa -20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Illumina shotgun (quality improvement; gaps) - ~ $10k per 5MB genome Solexa only - low cost; too fragmented; good assembler is needed! Solexa +PacBio - low cost; better sachffolding

Process Overview

Library Preparation - Sanger DNA fragmentation Random fragment DNA

Library Preparation - new

Sanger reads only (phrap, PGA, Arachne) ---------40kb-------- • Hybrid Sanger/pyrosequence/Solexa (no special assemblers; use Newbler, PGA, Arachne) --8kb-- --8kb-- --8kb-- --8kb-- 454 contig --8kb-- --8kb-- 454 shreds --3kb-- --8kb-- --8kb-- --3kb-- Shotgun reads PE reads Assembly (assembler) • 454/Solexa (Newbler, PCAP, Velvet, ALLPATH etc) –

PCR product pri1 pri2 16 PCR - sequence Draft assembly - what we get Assembly: set of contigs 10 16 21 Ordered sets of contigs (scaffolds) 10 21 PE Clone walk (Sanger lib) New technologies: no clones to walk off even if you can scaffold contigs (bPCR – new approach of gap closing)

PCR – sequence (un captured gaps) PCR product Template: gDNA Primer walking Clone walk (captured gaps) Clone A

Why do we have gaps What are gaps ? - Genome areas not covered by random shotgun • Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly – colony picking • Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences (new and old tech) • A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence): • ~ AT-rich DNA clones poorly in bacteria (cloning bias; • promoters like structures {Sanger} )=> uncaptured gaps • ~GC rich DNA is difficult to PCR and to sequence and often • requires the use of special chemistry => captured gaps • ~ high AT and GC content caused by problematic PCR (new tech)

Actual genome Assembling repeats

High GC sequencing problems: • The presence of small hairpins (inverted repeat sequences) in the • DNA that re anneal ether during sequencing or electrophoresis • resulting in failed sequencing reactions or unreadable electrophoresis • results. (This can be aided by adding modifiers to the reaction, • sequencing smaller clones and running gels at higher temperatures in • the presence of stronger denaturants).

Why more than one platform? 454 - high quality reliable skeletons of genomes (454 std + 454 PE): correctly assembled contigs; problems with repeats (unassembled or assembled in contigs outside of main scaffolds); homopolymer related frame shifts Illumina data is used to help improve the overall consensus quality, correct frameshifts and to close secondary structure related gaps; not ready for de-novo assembly of complex genomes (too many gaps!) Sanger – finishing reads; fosmids – larger repeats and templates for primer walk – less cost effective but very useful in many cases

454 (pyrosequence) and low GC genomes Thermotoga lettingae TMO Draft assembly +454 - 2 total contigs; 1 contigs >2kb - 454 – no cloning Sanger based draft assembly: - 55 total contigs; 41 contigs >2kb - 38GC% - biased Sanger libraries <166bp> - average length of gaps

454 and High GC projects Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb +454 PGA assembly - 9x of 8kb

NextGen high Quality Drafts at JGI (multiple sequencing platforms) Solexa Unassembled 454 reads Solexa contig 454/Sanger contig Fosmid ends* and 454 PE 1.Pyrosequence and Sanger to obtain main ordered and oriented part of the assembly – Newbler assembler 2. GapResolution (in house tool) to close some (up to 40%) gaps using unassembled 454 data – PGA or Newbler assemblers 3. Solexa reads to detect and correct errors in consensus – in house created tool (the Polisher) and close gaps (Velvet) * Fosmids ends not used for microbes

Solving gaps: gapResopution tool Step 2Assemble reads in contigs adjacent to the gap and reads obtained from contigs outside the scaffold. Sometimes use assembler other than Newbler for sub-assemblies (PGA) Step 1For each gap, identify read pairs from contigs found on different scaffolds Contig Gap Consensus from sub-assembly Contig Gap (due to repeat) Read pairs that are found in contigs outside of this scaffold

Solving gaps: gapResopution tool (II) Gap Contig Design sequencing reactions to close gap Step 3If gap is not closed, tool designs designs primers for sequencing reactions Step 4Iterate as necessary (in sub-assemblies) http://www.jgi.doe.gov/degilbert@lbl.gov

Velvet assembly Blast Velvet contigs against Newbler ends Use proper Velvet contigs to close gaps Solexa for gaps Velvet contig 454 Contig Gap Illumina reads Velvet contigs close gaps caused by hairpins and secondary structures

Low quality areas – areas of potential frameshifts Assemblies contain low quality regions (red tags)

Homopoymer related frameshifts Frameshift 1 (AAAAA, should be AAAA) homopolymers (n>=3) Frameshift 2 (CCCC, should be CCC) Modified from N. Ivanova (JGI)

Polisher: software for consensus quality improvement Step 2:Analyze and correct consensus errors Step 1:Align Illumina data to 454-only or Sanger/454 hybrid assembly Contig C T A A T Illumina reads A A G A Unsupported Corrections a. Illumina coverage < 10X b. Illumina coverage >= 10X and <70% of Illumina bases agree with the reference base Illumina coverage >= 10X and at least 70% llumina bases disagrees with the reference base Step 3:Design sequencing reactions for low quality and unsupported Illumina areas Sanger/454 low quality Unsupported Illumina region

Finished consensus 454 contig Sanger reads Errors corrected by Solexa Frame shift detected (454 contig) CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC

The process of taking a rough draft assembly composed of shotgun sequencing reads, identifying and resolving miss assemblies, sequence gaps and regions of low quality to produce a highly accurate finished DNA sequence. So, what is Finishing? Final quality: Final error rate should be less than 1 per 50 Kb. No gaps, no misassembled areas, no characters other than ACGT

Genome projectsArchaea + Bacteria only 298 Complete Genomes 137 Complete Genomes http://www.genomesonline.org/

Typically size of metagenomic sequencing project is very large Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies Chimerical contigs produced by co-assembly of sequencing reads originating from different species. Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. No assemblers developed for metagenomic data sets Metagenomic assembly and Finishing The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.

QC: Annotation of poor quality sequence To avoid this: -make sure you use high quality sequence -choose proper assembler A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.

Assembly mistakes A Bioinformatician's Guide to Metagenomics. Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.

Use Trimmer (Lucy etc) to treat reads PRIOR to assembly None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies. We currently test Newbler assembler for second generation sequencing: 454 only and 454/Solexa co-assembly Recommendations for metagenomic assembly

CAP reads + Non-CAP reads Metagenomic finishing: approach Candidatus Accumulibacter phosphatis(CAP) Binning:Which DNA fragment derived from which phylotype? (BLAST; GC%; read depth) Lucy/PGA Complete genome of Candidatus Accumulibacter phosphatis ~ 45%

Few more details: read quality

Merged assemblies ( k=31andk=51) with minimus(Cloneview used for visualization) • Green k=31 • Purple k=51 Illumina only data

Stats for 31, 51 and merged 31-51 assemblies

Thank you!

Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center

Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center

Presentation Transcript

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Cree

GIST Research at Fox Chase Cancer Center

Genome Assembly

Priyantha Jayawickrama, Ph.D. Associate Professor

Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center

Genome Assembly

Maryann Davis, Ph.D. Research Associate Professor

Genome Assembly

Associate Professor Ruth Banomyong (Ph.D.)

Genome Assembly

Jaime Lester, Ph.D. Associate Professor

Comparative Assembly for Cancer Human Genome

FOX CHASE Cancer Center

Genome Assembly

Susan Ceryak, Ph.D. Associate Research Professor

Rakesh Kumar Dutta, Ph.D. Associate Professor

Genome closure and finishing

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics

Robert B. Trelease, Ph.D. Associate Professor and Associate Director,

Adrian V. Lee, Ph.D. Associate Professor Breast Center

Robert B. Trelease, Ph.D. Associate Professor and Associate Director,