1 / 54

Overview and Applications of Next-Generation Sequencing Technologies

Overview and Applications of Next-Generation Sequencing Technologies. St éphane Deschamps. Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International. Outline. Next-Generation Sequencing Platforms 454 FLX technology Solexa/Illumina technology

paytah
Download Presentation

Overview and Applications of Next-Generation Sequencing Technologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview and Applications of Next-Generation Sequencing Technologies Stéphane Deschamps Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International

  2. Outline • Next-Generation Sequencing Platforms • 454 FLX technology • Solexa/Illumina technology • Applications of Next-Generation Sequencing Technologies • Overview • Variant detection with Illumina platform • Open-source tools for bioinformatics • Third-Generation Sequencing technologies: what’s next?

  3. Sanger sequencing Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h

  4. Sanger sequencing Sanger sequencing has been, and still is, very useful... ...but it remains slow and expensive

  5. Sequencing Platform Comparisons

  6. Next-Generation Sequencing • Second-generation platforms: • 454/Roche • Solexa/Illumina • SOLiD/ABI • Helicos BioSciences • Dover Systems • Third-generation platforms: • Complete Genomics • BioNanomatrix • VisiGen • Pacific Biosciences • Intelligent Bio-Systems • ZS Genetics • Reveo • LightSpeed Genomics • NABsys • Oxford Nanopore Technologies

  7. 454 FLX Titanium • First next-generation sequencing platform launched (October 2005) • Titanium chemistry for the 454 FLX launched in September 2008 • Sequencing By Synthesis • Pyrosequencing • Chemiluminescent signal • Long read technology (~450 nucleotides) • Possibility of sequencing both ends of DNA fragments (FLX platform) • Generates up to 0.5Gbps per run • Max cost is ~$10,000/run

  8. 454 FLX Titanium • DNA Library Construction • Emulsion PCR • Sequencing

  9. DNA Library Construction • DNA fragmentation via nebulization • Size-selection • Ligation of adapters A & B • Selection of A/B fragments via biotin selection • Denaturation to select single-stranded A/B fragments • No cloning! (B/B) A/B ss DNA End repair Streptavidin Streptavidin + Denaturation (A/B) + Emulsion PCR (A/A)

  10. Emulsion PCR • Add DNA to capture beads (needs titration) • Add PCR reagents to DNA and capture beads • Transfer sample to oil tube or cup • Emulsify DNA capture beads in PCR reagents to form water-in-oil “microreactors” • Emulsion with Qiagen TissueLyser (high-speed shaker) • Clonal amplification in microreactors • Careful not to break the emulsion! • ~10MM copies per capture bead • Break emulsion and enrich for DNA positive beads • Use biotinylated oligo to capture enriched beads then denature www.roche-applied-science.com

  11. Bead deposition into plates • Deposition of enriched beads into PicoTiter plate • Well diameter = 29uM allowing for a single bead (20uM diameter) per well • Chambers are filled with enzyme beads, DNA beads and packing beads. www.roche-applied-science.com

  12. Pyrosequencing • Polymerase add nucleotide (sequential flow of dNTPs) • PPi is released • Sulfurylase creates ATP from PPi • Luciferase hydrolyzes ATP and use luciferin to make light www.roche-applied-science.com

  13. Image and signal processing • Raw data is series of images (one image per base per cycle). • Data are extracted, quantified and normalized. • Read data are converted into “flowgrams”.

  14. Post-processing • Output = flowgrams, basecalls, Phred-equivalent scores • Basecall & Flowgrams can be used in the following applications: • De novo assembler – consensus sequences assembled into contigs with quality scores and ACE file (works best with genomic DNA). • Reference mapper – contigs mapped to reference sequence + list of high-confidence mutations • Amplicon variant analyzer – identification of sequence variants in amplicon libraries

  15. Illumina Genome Analyzer • Successor to MPSS (Massively Parallel Signature Sequencing) • Single molecule array (“flow cell”) with millions of amplified clusters • Sequencing By Synthesis • Removable fluorescence • Reversible terminators • Short read technology (16 - 75 nucleotides) • Possibility of sequencing both ends of DNA fragments • Generates up to 20Gbps per run • Max cost is ~$10,000/run = $500/Gbp!

  16. Illumina Genome Analyzer Sample Prep Cluster Station Genome Analyzer Prepare DNA fragments + Ligate adapters Cluster Synthesis Sequencing Analysis Pipeline

  17. Cluster Station

  18. Genome Analyzer Fluidics and Electronics Flow Cell & Detection Laser Optics

  19. Cluster Generation or RNA - anneal

  20. Cluster Generation - extension • DNA Clusters • ~1,000 copies of DNA in each cluster • 1-2 microns in diameter

  21. Reversible Terminator Chemistry

  22. 3’ 5’ T G T A C A A A A A A A A T T T T T T T T G G G G G G G C C C C C C C C C 5’ Sequencing by Synthesis (SBS) Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases Detect signal Deblock (removal of fluorescent dye and protecting group) Cycle 2 - n: Add sequencing reagents and repeat

  23. Sequencing by Synthesis (SBS)

  24. Illumina Analysis Pipeline Images (.tif) Image Analysis Base calling Data Analysis Workflow - Illumina Sequence Analysis alignment (ELAND), filtering (chastity) Data transfer and Storage • Cluster Intensities • Cluster Noise • Cluster Sequence • Cluster Probabilities (Scores) • Corrected Cluster Intensities • cross-talk correction • phasing correction 1 image per dye 4 dyes/cycle 75 cycles 50 tiles/column 2 columns/lane 8 lanes/flowcell 240,000 images per flowcell x8 MB per image 1.92 TB of image data x2 for PE run 3.8 TB of image data Alignments, Assemblies, Normalization, Annotations & Post-processing Evaluations • Image analysis module is Firecrest • Base calling module is Bustard • Sequence analysis module is Gerald

  25. Other platforms

  26. Data Storage & Quality Images? ~Phred 20 Phred score 20 = 1% error rate Quality vs. Read Length? Trimming? Lower sequence quality than Sanger sequencing but offset by deeper coverage

  27. Single short read uniqueness Illumina 35 base reads aligned to A. thaliana genome ~4MM reads

  28. Applications of Next-Generation Sequencing

  29. Gene Expression Profiling • Tag count & Alignments • Digital Gene Expression Tag Profiling • Short cDNA fragments mapping to 3’ ends of transcripts • SAGE-like approach (1 short tag/transcript) • 20 base tag output (RE site + 16 bases) aligned to a reference genome • Identify, quantify and annotate expressed genes • Transcriptome Profiling (RNA-Seq) • cDNA fragments generated via random priming • 36-75 base output aligned to a reference genome • Assemble entire transcript sequence • Identify, quantify and annotate expressed genes • Identify SNPs, alleles and alternative splice variants

  30. Tag Profiling – Sample Prep (Illumina) Total RNA (5ug) mRNA isolation AAAAA 1st and 2nd Strand cDNA Synthesis AAAAA TTTTT-bio Restriction Enzyme Digestion (DpnII or NlaIII) AAAAA CATG TTTTT-bio GEX Adaptor 1 Ligation MmeI CATG AAAAA GTAC TTTTT_bio MmeI digestion MmeI NN CATG GTAC GEX Adaptor 2 Ligation NN CATG NN GTAC PCR Amplification PCR Primer 2 PCR Primer 1 Cluster Generation CATG TAG GTAC sequencing primer

  31. Transcriptome Profiling – Sample Prep (Illumina) Tissue Total RNA isolation (10ug) mRNA isolation AAAAA Fragmentation (random) AAAAA 1st and 2nd Strand cDNA Synthesis (N6 primer) AAAAA TTTTT Adaptor Ligations PCR Amplification PCR Primer 2 Cluster Generation PCR Primer 1 sequencing primer 2 sequencing primer 1

  32. Novel Transcript Discovery • Small RNA Identification and Profiling • Small RNA size is suitable to discovery with next-generation sequencing • Deep assessment of alternative splicing isoforms • Deep coverage allows discovery of rare isoforms Mortazavi et al. (2008), Nat. Methods

  33. De novo Sequencing • Whole Genome Sequencing • Small genomes that are not too complex (microbial) • The longer the reads, the better – 454 chemistry most suitable • Paired-End sequencing • Whole Transcriptome Sequencing • Targeted Sequencing • Pooled PCR products • Raindance Technologies (~4,000 amplicons in one tube) • Padlock probes • Pooled BAC clones • Sequence Capture (Solid phase, Liquid phase) • Agilent, Febit & Nimblegen • Metagenomics & Microbial diversity

  34. Gene Regulation • ChIP-Seq (immunoprecipitate sequencing) • Capture regions of the genome bound by proteins (transcription factors, histones) • Sequences need to be aligned to a reference sequence • Requires complex algorithm to determine differential levels of coverage throughout the genome • Methyl-Seq (methylation status) – Bisulfite Sequencing • Sequences aligned and compared to reference sequence • DNAseI Hypersensitivity Site Sequencing Mikkelsen et al. (2007), Nature

  35. Variant & Structural Variation • Coverage & Alignment • Paired-End Sequencing • Whole Genome Resequencing • Small genomes that are not too complex (repeats, duplications...) • The longer the reads, the better • Targeted Resequencing • Complex genomes (crops) • Reduced representation libraries (methyl-sensitive enzymes) • Transcriptome • Sequence Capture (Microarrays) • Agilent, Febit & Nimblegen • CGH arrays

  36. Challenges in variant discovery • Base quality & filtering (scoring threshold) • Sequencing errors vs. SNPs • To differentiate true polymorphisms from sequencing errors • Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples) • Availability of a reference sequence (genome) • To separate unique vs. duplicated sequences • Duplication in one line but not another • Polymorphism rate in one line vs. another = need to set conditions for alignment • Paired-end sequencing can help unique read placement • Complex genomes = need to reduce complexity prior to sequencing • High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…) • Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...)

  37. Reduced-representation libraries • DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides • Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008). • Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription) PstI sites transposon transposon transposon PstI digestion Recover digested fraction (gel, column) P P P P P P P P

  38. Library Construction Genomic DNA Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation Ligation of biotinylated RE-specific adapters 1 B B Digestion with 4-bp cutter (DpnII) B GATC Ligation of DpnII-specific adapter B GATC CTAG Binding to streptavidin column and digestion with RE GATC CTAG Ligation of RE-specific adapters 2 GATC CTAG PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles) Deschamps et al. The Plant Genome (in press)

  39. SNP detection flowchart Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags) Condensing & optional consensus base-quality filter (for unitags sequences) Creating HQ unitag datasets (removing singlets) Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch Filtering, to accept clusters with only two members (A, B) with exactly one mismatch Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations. Filtering and Condensing Comparing two genotypes Mapping to genome Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes.

  40. 100,000 10,000 1,000 Frequency 100 10 1 10 100 1,000 10,000 100,000 Depth Example: one flow cell in soybean (Williams82 vs. Pintado) † Filtered total reads defined as having a quality value for individual base greater than or equal to 15 ‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2. § Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence

  41. Results & Validation * * *SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes

  42. Distribution of HQ unitags & SNPs related to annotated gene density (soybean) Gene Density (excluding TEs) in 500Kb window Coverage by HQ unitags in 70Kb window SNP Density in 70Kb window

  43. Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean Intron, CDS and UTR coordinates determined from GFF annotation files

  44. Bioinformatic tools • Alignment and Polymorphism Detection • SOAP – Short Oligonucleotide Alignment Program • Ruiqiang Li, Beijing Genomics Institute • http://soap.genomics.org.cn • MAQ – Mapping and Assembly with Quality • Heng Li, Sanger Centre • http://maq.sourceforge.net/maq-man.shtml • Bowtie - An ultrafast memory-efficient short read aligner • Ben Langmead and Cole Trapnell, University of Maryland • http://bowtie-bio.sourceforge.net/ • ssahaSNP – Tool to detect homozygous SNPs and indels • Adam Spargo and Zemin Ning, Sanger Centre • http://www.sanger.ac.uk/Software/analysis/ssahaSNP

  45. Bioinformatic tools • Genomic Assembly • Velvet – De novo assembly of short reads • Daniel Zerbino and Ewan Birney, EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet/ • SSAKE – Assembly of short reads • Rene Warren, et al, British Columbia Cancer Agency • http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500 • Euler – Genomic Assembly • Pavel Pevzner and Mark Chaisson, University of California, San Diego • http://nbcr.sdsc.edu/euler/ www.illumina.com

  46. Bioinformatic tools • ChIP Sequencing • ChIP-Seq Peak Finder • Barbara Wold, Cal Tech and Rick Meyers, Stanford University • http://woldlab.caltech.edu/html/software/ • Digital Gene Expression • Comparative Count Display • Alex Lash, NIH • ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/ • SAGE DGED Tool • Cancer Genome Anatomy Project • http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&ORG=Hs www.illumina.com

  47. Bioinformatic tools - Illumina • Overview • Obtain Bustard reads and align against Genome with Eland • Aggregate and SNP call data with CASAVA • GenomeStudio™ wizard import of data • Examine coverage and quality in stacked alignment graphs for a selected region/chromosome • Export table of SNPs and consensus sequence

  48. Bioinformatic tools - Illumina

  49. Third-Generation Sequencing technologies: what’s next?

  50. Next-Generation Sequencing • Second-generation platforms: • 454/Roche • Solexa/Illumina • SOLiD/ABI • Helicos BioSciences • Dover Systems • Third-generation platforms: • Complete Genomics • BioNanomatrix • VisiGen • Pacific Biosciences • Intelligent Bio-Systems • ZS Genetics • Reveo • LightSpeed Genomics • NABsys • Oxford Nanopore Technologies

More Related