Loading in 2 Seconds...
Loading in 2 Seconds...
Genome Assembly and Finishing Alla Lapidus, Ph.D. Associate Professor Fox Chase Cancer Center. A typical Microbial (and not only) project. Sequencing. Draft assembly . Goals: Completely restore genome Produce high quality consensus. FINISHING . Annotation. Public release.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Alla Lapidus, Ph.D.
Fox Chase Cancer Center
Completely restore genome
Produce high quality consensus
4x of 3kb plasmids + 4x of 8kb plasmids + 1x of fosmids
~ $50k for 5MB genome draftEvolution of Microbial Drafts
454 + Solexa
-20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Illumina shotgun (quality improvement; gaps)
- ~ $10k per 5MB genome
Solexa only - low cost; too fragmented; good assembler is needed!
Solexa +PacBio - low cost; better sachffolding
Random fragment DNA
PE readsAssembly (assembler)
What are gaps ?
- Genome areas not covered by random shotgun
454 - high quality reliable skeletons of genomes (454 std + 454 PE): correctly assembled contigs; problems with repeats (unassembled or assembled in contigs outside of main scaffolds); homopolymer related frame shifts
Illumina data is used to help improve the overall consensus quality, correct frameshifts and to close secondary structure related gaps; not ready for de-novo assembly of complex genomes (too many gaps!)
Sanger – finishing reads; fosmids – larger repeats and templates for primer walk – less cost effective but very useful in many cases
Thermotoga lettingae TMO
Draft assembly +454
- 2 total contigs; 1 contigs >2kb
- 454 – no cloning
Sanger based draft assembly:
- 55 total contigs; 41 contigs >2kb
- 38GC% - biased Sanger libraries
<166bp> - average length of gaps
Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC)
PGA assembly - 9x of 8kb +454
PGA assembly - 9x of 8kb
Unassembled 454 reads
Fosmid ends* and 454 PE
1.Pyrosequence and Sanger to obtain main ordered and oriented part of the assembly –
2. GapResolution (in house tool) to close some (up to 40%) gaps using unassembled 454 data –
PGA or Newbler assemblers
3. Solexa reads to detect and correct errors in consensus –
in house created tool (the Polisher) and close gaps (Velvet)
* Fosmids ends not used for microbes
Step 2Assemble reads in contigs adjacent
to the gap and reads obtained from contigs
outside the scaffold. Sometimes use assembler
other than Newbler for sub-assemblies (PGA)
Step 1For each gap, identify read
pairs from contigs found on different
Gap (due to repeat)
Read pairs that are found in contigs outside of this scaffold
Design sequencing reactions to close gap
Step 3If gap is not closed, tool designs
designs primers for sequencing reactions
Step 4Iterate as necessary (in sub-assemblies)
Assemblies contain low quality regions (red tags)
Frameshift 1 (AAAAA, should be AAAA)
Frameshift 2 (CCCC, should be CCC)
Modified from N. Ivanova (JGI)
Step 2:Analyze and correct consensus errors
Step 1:Align Illumina data to 454-only
or Sanger/454 hybrid assembly
a. Illumina coverage < 10X
b. Illumina coverage >= 10X and
<70% of Illumina bases agree
with the reference base
Illumina coverage >= 10X and at least 70% llumina bases disagrees with the reference base
Step 3:Design sequencing reactions for low
quality and unsupported Illumina areas
Sanger/454 low quality
Unsupported Illumina region
Sanger readsErrors corrected by Solexa
Frame shift detected (454 contig)
shotgun sequencing reads, identifying and resolving miss
assemblies, sequence gaps and regions of low quality to
produce a highly accurate finished DNA sequence.So, what is Finishing?
Final error rate should be less than 1 per 50 Kb.
No gaps, no misassembled areas, no characters other than ACGT
Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members
Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies
Chimerical contigs produced by co-assembly of sequencing reads originating from different species.
Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.
No assemblers developed for metagenomic data setsMetagenomic assembly and Finishing
The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.
To avoid this:
-make sure you use high quality sequence
-choose proper assembler
A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
A Bioinformatician's Guide to Metagenomics. Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies.
We currently test Newbler assembler for second generation sequencing: 454 only and 454/Solexa co-assemblyRecommendations for metagenomic assembly
Illumina only data