1 / 42

Improving t he Accuracy o f Genome Assemblies

Improving t he Accuracy o f Genome Assemblies. July 17 th 2012. Roy Ronen *,1 , Christina Boucher *,1 , Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work. ≈ $ thousands ≈ several weeks

kimball
Download Presentation

Improving t he Accuracy o f Genome Assemblies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the Accuracy of Genome Assemblies July 17th 2012 Roy Ronen*,1, Christina Boucher*,1, Hamidreza Chitsaz2 and Pavel Pevzner1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work

  2. ≈ $ thousands ≈ several weeks ≈ two people ≈ $ billions ≈ several years ≈ hundreds of people

  3. High Throughput Sequencing Assemblies

  4. Draft Genome from HTS Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis, Analysis, Analysis

  5. Sample Preparation Fragments Sequencing • HTS assemblies (contigs) still contain an abundance of error: • 20-30 subst. errors per 100kbp with SOAPdenovo. • 5-20 subst. errors per 100kbp with Velvet. • Small (<50 bp) INDEL errors. • Misassemblies, large INDELs, etc. Reads Assembly Contigs Analysis, Analysis, Analysis

  6. Sample Preparation Fragments Sequencing Reads Errors in the assembled contigs will profoundly affect any downstream analysis. Assembly Contigs Analysis, Analysis, Analysis

  7. Sample Preparation Fragments Sequencing Reads SEQuel Assembly Contigs Analysis, Analysis, Analysis Refined Contigs

  8. De Bruijn Graph for Fragment Assembly

  9. De Bruijn Graph GCC CCA CAT CCT GCC ATT TTT CCT CTA TAT CTT CCA CTA TTA CAT TTT ATT CCT ATT TTA TAT CTT (Pevzner, Tang, Waterman 2001)

  10. De Bruijn Graph CCA CCA GCC ATT CCT GCC CAT TTT CCT CTA TAT CTT TTA CCT ATT CTT TTT ATT TTA CTA CAT TAT (Pevzner, Tang, Waterman 2001)

  11. De Bruijn Graph CCA CAT TAT CTA TTT GCC CTT ATT CAT GCC TTA ATT TAT CTA CTT ATT TTT TTA CCT CCT CCT (Pevzner, Tang, Waterman 2001)

  12. De Bruijn Graph GCC CCA CAT CAT ATT TAT CTT CTA TTT TTA ATT TTA CTT ATT CTA TAT TTT GCC CCT (Pevzner, Tang, Waterman 2001)

  13. De Bruijn Graph CCA CAT TTT CTT TAT CTA CAT GCC ATT ATT TAT TTT TTA CTT TTA ATT CTA CCT (Pevzner, Tang, Waterman 2001)

  14. De Bruijn Graph

  15. Challenges

  16. GCC CCT AGG GGA CTA GAC TAG CAC ACT TGG GGC CTT GCA TTG GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............

  17. Sequencing errors cause bulges in the de Bruijn graph GCC CCT AGG GGA CTA GAC TAG CAC ACT TGG GGC CTT GCA TTG TGGA TTGA CTTG CCTT GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............

  18. Sequencing errors cause bulges in the de Bruijn graph 2 2 AGG CTA 2 2 TAG 3 3 GCC CCT GGA GAC 1 1 4 4 TGG CTT TTG 3 GGC GCA 3 CAC ACT 3 3 GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............

  19. Sequencing errors cause bulges in the de Bruijn graph 3 3 GCC CCT GGA GAC 1 1 4 4 TGG CTT TTG 3 GGC GCA 3 CAC ACT 3 3 ......GCCTTGGAC...... ......CACTTGGCA...... GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............

  20. The SEQuelAlgorithm

  21. Sample Preparation Fragments Sequencing Reads SEQuel Assembly Contigs Analysis, Analysis, Analysis Refined Contigs

  22. The SEQuel Algorithm 53 12 25 29 34 40 21 32 19 8 26 39 68 81 75 34 44 21 89 57 Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely.

  23. Positional De Bruijn Graph

  24. Positional De Bruijn Graph Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111). GCC,975 GCC,111 CCT,976 CCT,112 CTA,977 TTT,114 CAT,113 TAT,978 CCA,112 CTT,113 ATT,114 TTA,115 TAT,978 CTA,977 CCT,976 TTT,114 ATT,979 CTT,113 TTA,115 ATT,114 CAT,113 CCA,112

  25. Positional De Bruijn Graph TAT,978 CTA,977 CCT,976 GCC,975 CTT,113 CCT,112 TTT,114 GCC,111 TAT,978 CTA,977 TTA,115 CCT,976 ATT,979 TTT,114 CCA,112 ATT,114 CTT,113 ATT,979 TTA,115 CCA,112 CCA,112 CAT,113 CAT,113 ATT,114 ATT,114

  26. Positional De Bruijn Graph 4 4 4 4

  27. The SEQuel Algorithm partial contig #1: GCCATTA partial contig #2: GCCTATT Original contig GTATTCCGAGGACCACTGGATTATGA

  28. The SEQuel Algorithm GTATTCCGAGGACCACTGGATTATGA 28

  29. The SEQuel Algorithm GTATTCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 29

  30. The SEQuel Algorithm GTATTCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 30

  31. The SEQuel Algorithm GCGGGCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 31

  32. The SEQuel Algorithm GCGGGCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 32

  33. The SEQuel Algorithm GCGGGCCGAGGACCACAAATGGATTACGA GCGGGCCGAGGA CAAATGGATTACGA 33

  34. The SEQuel Algorithm GCGGGCCGAGGACCACAAATGGATTACGA Repeat for all contigs. 34

  35. Results • Standard and Single-Cell E. coli. • 100 bp paired-end, Illumina (GAII) reads. • Mean coverage ≈ 600x. • Assemblies compared to reference with & without SEQuel.

  36. Standard E. coli

  37. Standard E. coli

  38. Single Cell Sequencing Single Cell Standard (Chitsaz et al., 2011)

  39. Single Cell E. coli

  40. Single Cell E. coli

  41. Summary • Removed 35% to 96% of small-scale assembly errors. • Introduced positional de Bruijn graph for contig refinement. • Demonstrated utility in hard (single-cell) assembly. • SEQuel can be used in combination with any assembler. • Freely available at: http://bix.ucsd.edu/SEQuel

  42. Acknowledgments 3P41RR024851-02S1 CCF-1115206

More Related