1 / 57

Institute of Biomedical Sciences University of São Paulo

Institute of Biomedical Sciences University of São Paulo . DNA Assembly and Mapping. Arthur Gruber. Coccilab – ICB/USP. Next- generation sequencing platforms. Mid 2000’s: next-generation sequencers (NGS) were developed 2004 – 454 (Roche, formerly 454 Life Sciences )

creola
Download Presentation

Institute of Biomedical Sciences University of São Paulo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Institute of Biomedical Sciences University of São Paulo DNA Assembly and Mapping Arthur Gruber Coccilab – ICB/USP

  2. Next-generationsequencingplatforms • Mid 2000’s: next-generationsequencers (NGS) weredeveloped • 2004 – 454 (Roche, formerly 454 Life Sciences) • 2006 – Illumina (formerlySolexa) • 2008 – SOLiD (Life Technologies, formerlyAppliedBiosystems) • 2011 –IonTorrent /Proton(Life Technologies) • 2011 – PacBio RS (Pacific Biosciences) • Massivelyparallelsequencing - tipo shotgun(randomfragments) • Generatemillionsofsequences in one single runat a lowcost per base

  3. Data generationxcost Moore Law Cost per MB of sequence Source: Sboneret al. (2011) - Genome Biol. 12 (8): 125

  4. Evolution of sequencing costs An estimate of the evolution of sequencing costs over the last 10 years. Costs are given for sequencing a megabase using a logarithmic scale. This curve is adapted from [15]. Time of introduction of new technologies is indicated. Source: Delseny et al. (2010). Plant Science 179 (5): 407–422 DNA Assembly Coccilab – ICB/USP

  5. NGS – Lowercostandgreater data generation Source: Sboneret al. (2011) - Genome Biol. 12 (8): 125

  6. Next-generationsequencingplatforms Source: Glen (2011). MolEcol Resources 11: 759–769

  7. Next-generationsequencingplatforms Source: Glen (2011). MolEcol Resources 11: 759–769

  8. Next-generationsequencingplatforms Source: Glen (2011). MolEcol Resources 11: 759–769

  9. Different types of sequencing methods A flow chart of the different types of sequencing methods Source: Delseny et al. (2010). Plant Science 179 (5): 407–422 Coccilab – ICB/USP

  10. 454 Workflow Source: Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402

  11. Illumina Workflow Source: Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402

  12. SOLiD Workflow Source: Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402

  13. NGS platforms – applications Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197. DNA Assembly Coccilab – ICB/USP

  14. NGS platforms – applications Tool Website Category Platform Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197. DNA Assembly Coccilab – ICB/USP

  15. Sequenceassembly • Currentsequencingplatformcanonlygeneratesequencereadsofdozensofbp (socalled short reads) or some hundredsofreads (Sanger, 454, IonTorrent, PacBio) 1 • Computational tools are necessaryto assemble thesequencereadsinto a largersequencesegment/genome • Sequenceassemblers use twodifferent approaches to assemble reads: • Overlap layout consensus • de Bruijngraphs 2 Schatz et al. (2010) - Assembly of large genomes using second-generation sequencing

  16. K-mer graph A pair-wise overlap represented by a K-mer graph. (a) Two reads have an error-free overlap of 4 bases. (b) One K-mer graph, with K=4, represents both reads. The pair-wise alignment is a by-product of the graph construction. (c) The simple path through the graph implies a contig whose consensus sequence is easily reconstructed from the path. Source: Miller et al. (2010). Genomics 95: 315-327 DNA Assembly Coccilab – ICB/USP

  17. Complexity in K-mer graphs Complexity in K-mer graphs can be diagnosed with read multiplicity information. In these graphs, edges represented in more reads are drawn with thicker arrows. (a) An errant base call toward the end of a read causes a “spur” or short dead-end branch. The same pattern could be induced by coincidence of zero coverage after polymorphism near a repeat. (b) An errant base call near a read middle causes a “bubble” or alternate path. Polymorphisms between donor chromosomes would be expected to induce a bubble with parity of read multiplicity on the divergent paths. (c) Repeat sequences lead to the “frayed rope” pattern of convergent and divergent paths. Source: Miller et al. (2010). Genomics 95: 315-327 DNA Assembly Coccilab – ICB/USP

  18. de BruijnGraphs • Advantages: • Candealwithlargeamountsof data, consolidatesredundantreads (high coverage) in a veryefficientway • Sequencingerrors are promptlyidentifiedfromthetopologyofthegraphandk-mercoverage de BRUIJN Graph Erro Edge formation in thegraph

  19. Evaluatingassemblies • SizeofLargestContig • Numberofcontigs > nlength • N50 • Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. In other words, N50 is the contig length such that using equal or longer contigs produces half the bases of the genome. Therefore, the number of bases from of all sequences shorter than the N50 will equal the number of bases from all sequences longer than the N50.

  20. Evaluatingassemblies • N50 • Contigorscaffold N50 is a weightedmedianstatisticsuchthat 50% oftheentireassemblyiscontained in contigsorscaffoldsequaltoorlargerthanthisvalue

  21. Some definitions • Contig • A sequence contig is a contiguous, overlapping sequence read resulting from the reassembly of the small DNA fragments generated by sequencing strategies • Scaffold • Usingpaired-endsequencingtechnology, thedistancebetweenbothsequenceendsof a fragmentisknown. Thisgivesadditionalinformationabouttheorientationofcontigsconstructedfromthesereadsandallows for theirassemblyintoscaffolds.

  22. Libraries for NGS platforms

  23. Paired-end technology Schematic drawing of the paired-end technology. Adaptors and genome fragments are represented respectively by the black and grey lines. B) Strategy for sequencing large DNA fragments: short reads are assembled into contigs. A high coverage is required. In the next steps, paired-ends derived from larger fragments are used to assemble contigs into scaffolds. Source: Delseny et al. (2010). Plant Science 179 (5): 407–422 DNA Assembly Coccilab – ICB/USP

  24. Contigsandscaffolds

  25. Anexampleof a real file 454 data

  26. Anexampleof real file 454 data

  27. Anexampleof real file 454 data

  28. NGS platforms – performances and features Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197. DNA Assembly Coccilab – ICB/USP

  29. Comparison of De Novo Genome Assemblers Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915. DNA Assembly Coccilab – ICB/USP

  30. Comparison of De Novo Genome Assemblers Accuracy and integrity for 36-mer datasets assembly. The quality of consequential contigs is shown with: the accuracy of assembled contigs the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset. Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915. DNA Assembly Coccilab – ICB/USP

  31. Comparison of De Novo Genome Assemblers Accuracy and integrity for 75-mer datasets assembly. The quality of consequential contigs is shown with: the accuracy of assembled contigs the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset. Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915. DNA Assembly Coccilab – ICB/USP

  32. Comparison of De Novo Genome Assemblers Statistics for assembled contigs of 36-mer short reads. Indicatrix that illustrates the feature of size distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome. Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915. DNA Assembly Coccilab – ICB/USP

  33. Comparison of De Novo Genome Assemblers Statistics for assembled contigs of 75-mer short reads. Indicatrix that illustrates the feature of size distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome. Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915. DNA Assembly Coccilab – ICB/USP

  34. Genomes assembled de novo exclusively from Illumina short sequence reads • Organisms: • Turkey (Meleagrisgallopavo) • Giant panda (Ailuropodamelanoleuca) • Bacillus subtilis168 • Bacillus subtilisnatto • Pseudomonas syringaepv. tabaci 11528 • Pseudomonas syringaepv. syringae Psy642 • Pseudomonas syringaepv. tomato T1 • Pseudomonas syringaepv. Aesculi • Apple scab (Ventura inaequalis) • Pine (Pinus species) chloroplast Paszkiewicz & Studholme (2010). BriefBioinform11 (5): 457-472. DNA Assembly Coccilab – ICB/USP

  35. Assembly results using real illumina single-end and paired-end reads from SRA Source: Bao et al. (2011). Journal of Human Genetics 56: 406–414. DNA Assembly Coccilab – ICB/USP

  36. Biases in real short-read sequence data Illustrates the depth of coverage by aligned reads over the 6 Mb circular chromosome. Coverage is shallower around the 3 Mb region than it is near the origin of replication (position 0) Illustrates the expected frequency distribution of alignment depth, assuming random sampling of the genome Illustrates the observed frequency distribution of alignment depth, which is broader than the expected distribution, indicating greater variance due to biased sampling. Source: Paszkiewicz & Studholme (2010). BriefBioinform11 (5): 457-472. DNA Assembly Coccilab – ICB/USP

  37. Limitations of next-generation genome sequence assembly • Limitations: • NGS technologies typically generate shorter sequences with higher error rates from relatively short insert libraries • Assembly of longer repeats and duplications will suffer from this short read length • Assembly methods for short reads are based on de Bruijn graph and Eulerian path approaches, which have difficulty in assembling complex regions of the genome. • DNA contamination or insertion polymorphism? Source: Alkan et al. (2010). Nat Methods 8(1): 61-65. DNA Assembly Coccilab – ICB/USP

  38. Limitations of next-generation genome sequence assembly • Limitations: • Repeat content • WGS-based de novo sequence assembly algorithm will collapse identical repeats, resulting in reduced or lost genomic complexity. • Missing and fragmented genes Source: Alkan et al. (2010). Nat Methods 8(1): 61-65. DNA Assembly Coccilab – ICB/USP

  39. Data generation and analysis steps of a typical RNA-seq experiment. Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. Coccilab – ICB/USP

  40. Reference-based transcriptome assembly strategy Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. DNA Assembly Coccilab – ICB/USP

  41. Overview of the de novo transcriptome assembly strategy Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. Coccilab – ICB/USP

  42. Alternative approaches for combined transcriptome assembly Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. DNA Assembly Coccilab – ICB/USP

  43. Software for transcriptome assembly Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. DNA Assembly Coccilab – ICB/USP

  44. Splice-aware short-read aligners Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682. DNA Assembly Coccilab – ICB/USP

  45. Mapping reads onto a reference sequence • Programs: • Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. • Available at http://bowtie-bio.sourceforge.net/index.shtml • SHRiMPis a software package for aligning genomic reads against a target genome. Available at http://compbio.cs.toronto.edu/shrimp/ • BarraCUDA - an ultra fast short read sequence alignment software using GPUs. • Available at http://www.many-core.group.cam.ac.uk/projects/lam.shtml • Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. • Available at http://bio-bwa.sourceforge.net/ DNA Assembly Coccilab – ICB/USP

  46. Mapping reads onto a reference sequence • Programs: • BLAT is a bioinformatics software a tool which performs rapid mRNA/DNA and cross-species protein alignments • Available at http://www.kentinformatics.com/products.html • BFAST facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include: • Speed: enables billions of short reads to be mapped quickly. • Accuracy: A priori probabilities for mapping reads with defined set of variants. • An easy way to measurably tune accuracy at the expense of speed. • Available at http://sourceforge.net/apps/mediawiki/bfast/index.php?title=Main_Page Coccilab – ICB/USP

  47. Visualizing reads mapped onto a reference sequence • Programs: • TABLET - lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments. • Available at http://bioinf.scri.ac.uk/tablet/index.shtml • IGV - Integrative Genomics Viewer - a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. • Available at http://www.broadinstitute.org/igv/ Coccilab – ICB/USP

  48. TABLET - graphical viewer Coccilab – ICB/USP

  49. Integrative Genomics Viewer (IGV) Coccilab – ICB/USP

  50. Data formats - SOLiD • Color Space: • Also known as 2-base (Di-Base) encoding, is based on ligation sequencing rather than sequencing by synthesis. • Each base in this sequencing method is read twice. This changes the color of two adjacent color space calls, therefore in order to miscall a SNP, two adjacent colors must be miscalled. • Requires specific software to manipulate the data. Most assemblers are not designed to deal with color space. Coccilab – ICB/USP

More Related