1 / 38

Genome Assembly Final Results

Genome Assembly Final Results. Jeri Dilts Suzanna Kim Hema Nagrajan Deepak Purushotham AMBILY SIVADAS AMIT RUPANI LEO WU. 02 -22- 2012. Outline. Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo.

naava
Download Presentation

Genome Assembly Final Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Assembly Final Results Jeri Dilts Suzanna Kim Hema Nagrajan Deepak Purushotham AMBILY SIVADAS AMIT RUPANI LEO WU 02 -22- 2012

  2. Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

  3. Pipeline for evaluation

  4. Strategy – Key alterations • Prinseq Preprocessing Unnecessary, assemblers have built in capabilities • Use Prinseq for data statistics • Error Correction Does not fit methods • Coral is based on Overlap-layout-consensus and works best with de Bruijin Graph assemblers • Echo has never been tested on 454 data • Final Assemblers Newbler, Mira, Celera, AmosCMP • Discarded Assemblers Abyss, Velvet, and Pcap454 • MAIA Hybrid Assembly Needs a close phylogenetic reference genome

  5. Outline Pipeline for evaluation Quantitative Evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

  6. Quantitative Evaluation • Metrics • No. of Contigs -> Lesser the better • N50 -> Higher the better • Assembly size -> Closer to the estimated genome, the better • Quantitative Assembly Score N50 * Assembly size No. of Contigs • Higher the score, the better!

  7. M19107 - Evaluation

  8. Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

  9. Qualitative Evaluation • Strategy • Align the assembly contigs to the original reference genome and compute differences • Challenges • No Original reference genome for our data set • Approach • Create simulated 454 read datasets from a completely sequenced genome • Tools used • FlowSim • 454Sim • Art-454

  10. FlowSim • A simulation pipeline based on real data • Lets you model each step of pyrosequencing process • Utilities: • Clonesim : To simulate the shearing step • Usage: clonesim -c count -l dist input.fasta • Gelfilter: To select a certain range of clone lengths. • Usage: gelfilter min max • Kitsim: To attach A and B adaptors. • Usage: kitsim -k key -a adapter input.fasta -o output.fasta • Mutator: To introduce random substitutions and indels in the sequences. • Usage: mutator -iindel_rate -s subst_rateinput.fasta -o output.fasta • Duplicator: Togenerate artificial duplicates of many clones. • Usage: duplicator dup_prob • Flowsim : To simulate the actual pyrosequencing process • Usage: flowsim -G generation input.fasta -o output.sff • Example: clonesim -c 400000 –l “Normal 350 95” input.fasta | gelfilter 25 600| kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff

  11. 454Sim • 454 Simulation at higher speed and accuracy • USP: Configurable statistical models • Support GS FLX, Titanium and GS 20 • Utilities: • fragsim: To simulate shearing • Usage: fragsim -c 1000000 -l 1000 genome.fasta > genome.fragments.fasta • 454sim: To simulate the sequencing step • Usage: 454sim -o genome.sff genome.fragments.fasta • Example: • fragsim -c 250000 -l 1000 genome.fasta | 454sim –g FLX -o genome.sff

  12. ART-454 • Supports Illumina, 454 and Solexa read simulation • Used for 1000 Genomes Project • Usage: • Art_454 Input.fasta Output prefix Fold_coverage (single – end reads) • Art_454 Input.fasta Output prefix Fold_coverageMean_Flag_LenStd_Deviation (paired end reads)

  13. Running pipeline on Simulated reads Reference – Haemophilusinfluenzae F3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Evaluate Assembly Accuracy (How?)

  14. Assembly Accuracy • Challenges • Alignment of contigs to the reference genome • Approach • Local alignment (BLAST, bwa, bowtie) • Whole genome alignment (Mauve, MUMmer) • Align the assembly to the reference genome • Compute nucleotide differences, gaps and rearranged segments

  15. Mauve • Uses positional homology genome alignment • Each site in the assembly maps to at most one site on the reference • Optimized contiguity • E.g. progressiveMauve • Ordering of contigs: Mauve Contig Mover algorithm • Compare to identify differences

  16. Mauve Genome Aligner

  17. After Ordering of Contigs

  18. Mauve Assembly Metrics • Basecalling accuracy • Count and location of bases called wrongly • Direction of miscalling, e.g. A->G • Count and location of bases predicted to exist, but uncalled • Genome content accuracy • Count and location of bases missing from the assembly • Count and location of extra bases in the assembly • Size distribution of the missing and extra fragments • Genome structure accuracy • Estimate of misassembly count

  19. Example • Reference genome • AGGCTAGCGCGCGATTAGGATC • Assembly • AGTAGCGGGCCGATTAAGANC • Alignment • AGGCTAGCGCG - CGATTAGGATC • AG - - TAGCGGGCCGATTAAGANC • Miscalls • 2 (C->G and G->A) • Uncalled bases • 1 (N) • Extra bases • 1 (Insertion of C ) • Missing bases • 2 (Deletion of GC ) • Missing segments • 1 • Extra segments • 1

  20. Scoring simulated reads with Mauve Reference – HaemophilusinfluenzaeF3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Ran Mauve to align the assemblies back to the reference genome Computed Assembly metrics

  21. Miscalled Bases

  22. Uncalled bases

  23. Total missing bases

  24. Total extra segments

  25. Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

  26. Choosing the BEST assembly • Quantitative metrics • N50 • Contig count • Assembly size • Qualitative metrics • Miscalled bases • Uncalled • Missing bases • Extra bases

  27. Assembly Scores Quantitative Score N50 * Assembly size No. of Contigs Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in Assembly 1 - Reference Size

  28. Metrics Summary – Art 454 ASSEMBLY SCORE QUALITY SCORE

  29. Assembly spec. vs Accuracy plot – 454Sim

  30. Assembly spec. vs Accuracy plot - Art-454

  31. Assembly spec. vs Accuracy plot – FlowSim

  32. Assembly spec. vs Accuracy plot – M21709

  33. Inference • Striking a balance is critical • We chose • Newbler + MIRA for H. haemolyticus • Newbler + AMOScmp for H. influenzae Universally applicable pipeline Assembling specific genomes/strains • Choose the one that works the best balance for your genome • NEWBLER + (CELERA/MIRA) • Adopt the most consistent tool /pipeline (Conservative approach) • NEWBLER

  34. Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

  35. Final Results

  36. Key take-aways • Understand your data • Platform, long/short reads, Coverage, Paired/Non-paired, Quality of basecalling etc • Evaluate the need for error correction • Choose a set of “best” assemblers • De novo /Reference assembly, DBG/OLC algorithm • Merge assemblies • Ordering and Scaffolding • Finishing Evaluate your assembly at every step to ensure that you are on the right track!

  37. Coming next >>> Demo

More Related