760 likes | 1.47k Views
Overview and Applications of Next-Generation Sequencing Technologies. St éphane Deschamps. Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International. Outline. Next-Generation Sequencing Platforms 454 FLX technology Solexa/Illumina technology
E N D
Overview and Applications of Next-Generation Sequencing Technologies Stéphane Deschamps Analytical & Genomic Technologies DuPont Agriculture & Nutrition Pioneer Hi-Bred International
Outline • Next-Generation Sequencing Platforms • 454 FLX technology • Solexa/Illumina technology • Applications of Next-Generation Sequencing Technologies • Overview • Variant detection with Illumina platform • Open-source tools for bioinformatics • Third-Generation Sequencing technologies: what’s next?
Sanger sequencing Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h
Sanger sequencing Sanger sequencing has been, and still is, very useful... ...but it remains slow and expensive
Next-Generation Sequencing • Second-generation platforms: • 454/Roche • Solexa/Illumina • SOLiD/ABI • Helicos BioSciences • Dover Systems • Third-generation platforms: • Complete Genomics • BioNanomatrix • VisiGen • Pacific Biosciences • Intelligent Bio-Systems • ZS Genetics • Reveo • LightSpeed Genomics • NABsys • Oxford Nanopore Technologies
454 FLX Titanium • First next-generation sequencing platform launched (October 2005) • Titanium chemistry for the 454 FLX launched in September 2008 • Sequencing By Synthesis • Pyrosequencing • Chemiluminescent signal • Long read technology (~450 nucleotides) • Possibility of sequencing both ends of DNA fragments (FLX platform) • Generates up to 0.5Gbps per run • Max cost is ~$10,000/run
454 FLX Titanium • DNA Library Construction • Emulsion PCR • Sequencing
DNA Library Construction • DNA fragmentation via nebulization • Size-selection • Ligation of adapters A & B • Selection of A/B fragments via biotin selection • Denaturation to select single-stranded A/B fragments • No cloning! (B/B) A/B ss DNA End repair Streptavidin Streptavidin + Denaturation (A/B) + Emulsion PCR (A/A)
Emulsion PCR • Add DNA to capture beads (needs titration) • Add PCR reagents to DNA and capture beads • Transfer sample to oil tube or cup • Emulsify DNA capture beads in PCR reagents to form water-in-oil “microreactors” • Emulsion with Qiagen TissueLyser (high-speed shaker) • Clonal amplification in microreactors • Careful not to break the emulsion! • ~10MM copies per capture bead • Break emulsion and enrich for DNA positive beads • Use biotinylated oligo to capture enriched beads then denature www.roche-applied-science.com
Bead deposition into plates • Deposition of enriched beads into PicoTiter plate • Well diameter = 29uM allowing for a single bead (20uM diameter) per well • Chambers are filled with enzyme beads, DNA beads and packing beads. www.roche-applied-science.com
Pyrosequencing • Polymerase add nucleotide (sequential flow of dNTPs) • PPi is released • Sulfurylase creates ATP from PPi • Luciferase hydrolyzes ATP and use luciferin to make light www.roche-applied-science.com
Image and signal processing • Raw data is series of images (one image per base per cycle). • Data are extracted, quantified and normalized. • Read data are converted into “flowgrams”.
Post-processing • Output = flowgrams, basecalls, Phred-equivalent scores • Basecall & Flowgrams can be used in the following applications: • De novo assembler – consensus sequences assembled into contigs with quality scores and ACE file (works best with genomic DNA). • Reference mapper – contigs mapped to reference sequence + list of high-confidence mutations • Amplicon variant analyzer – identification of sequence variants in amplicon libraries
Illumina Genome Analyzer • Successor to MPSS (Massively Parallel Signature Sequencing) • Single molecule array (“flow cell”) with millions of amplified clusters • Sequencing By Synthesis • Removable fluorescence • Reversible terminators • Short read technology (16 - 75 nucleotides) • Possibility of sequencing both ends of DNA fragments • Generates up to 20Gbps per run • Max cost is ~$10,000/run = $500/Gbp!
Illumina Genome Analyzer Sample Prep Cluster Station Genome Analyzer Prepare DNA fragments + Ligate adapters Cluster Synthesis Sequencing Analysis Pipeline
Genome Analyzer Fluidics and Electronics Flow Cell & Detection Laser Optics
Cluster Generation or RNA - anneal
Cluster Generation - extension • DNA Clusters • ~1,000 copies of DNA in each cluster • 1-2 microns in diameter
3’ 5’ T G T A C A A A A A A A A T T T T T T T T G G G G G G G C C C C C C C C C 5’ Sequencing by Synthesis (SBS) Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases Detect signal Deblock (removal of fluorescent dye and protecting group) Cycle 2 - n: Add sequencing reagents and repeat
Illumina Analysis Pipeline Images (.tif) Image Analysis Base calling Data Analysis Workflow - Illumina Sequence Analysis alignment (ELAND), filtering (chastity) Data transfer and Storage • Cluster Intensities • Cluster Noise • Cluster Sequence • Cluster Probabilities (Scores) • Corrected Cluster Intensities • cross-talk correction • phasing correction 1 image per dye 4 dyes/cycle 75 cycles 50 tiles/column 2 columns/lane 8 lanes/flowcell 240,000 images per flowcell x8 MB per image 1.92 TB of image data x2 for PE run 3.8 TB of image data Alignments, Assemblies, Normalization, Annotations & Post-processing Evaluations • Image analysis module is Firecrest • Base calling module is Bustard • Sequence analysis module is Gerald
Data Storage & Quality Images? ~Phred 20 Phred score 20 = 1% error rate Quality vs. Read Length? Trimming? Lower sequence quality than Sanger sequencing but offset by deeper coverage
Single short read uniqueness Illumina 35 base reads aligned to A. thaliana genome ~4MM reads
Gene Expression Profiling • Tag count & Alignments • Digital Gene Expression Tag Profiling • Short cDNA fragments mapping to 3’ ends of transcripts • SAGE-like approach (1 short tag/transcript) • 20 base tag output (RE site + 16 bases) aligned to a reference genome • Identify, quantify and annotate expressed genes • Transcriptome Profiling (RNA-Seq) • cDNA fragments generated via random priming • 36-75 base output aligned to a reference genome • Assemble entire transcript sequence • Identify, quantify and annotate expressed genes • Identify SNPs, alleles and alternative splice variants
Tag Profiling – Sample Prep (Illumina) Total RNA (5ug) mRNA isolation AAAAA 1st and 2nd Strand cDNA Synthesis AAAAA TTTTT-bio Restriction Enzyme Digestion (DpnII or NlaIII) AAAAA CATG TTTTT-bio GEX Adaptor 1 Ligation MmeI CATG AAAAA GTAC TTTTT_bio MmeI digestion MmeI NN CATG GTAC GEX Adaptor 2 Ligation NN CATG NN GTAC PCR Amplification PCR Primer 2 PCR Primer 1 Cluster Generation CATG TAG GTAC sequencing primer
Transcriptome Profiling – Sample Prep (Illumina) Tissue Total RNA isolation (10ug) mRNA isolation AAAAA Fragmentation (random) AAAAA 1st and 2nd Strand cDNA Synthesis (N6 primer) AAAAA TTTTT Adaptor Ligations PCR Amplification PCR Primer 2 Cluster Generation PCR Primer 1 sequencing primer 2 sequencing primer 1
Novel Transcript Discovery • Small RNA Identification and Profiling • Small RNA size is suitable to discovery with next-generation sequencing • Deep assessment of alternative splicing isoforms • Deep coverage allows discovery of rare isoforms Mortazavi et al. (2008), Nat. Methods
De novo Sequencing • Whole Genome Sequencing • Small genomes that are not too complex (microbial) • The longer the reads, the better – 454 chemistry most suitable • Paired-End sequencing • Whole Transcriptome Sequencing • Targeted Sequencing • Pooled PCR products • Raindance Technologies (~4,000 amplicons in one tube) • Padlock probes • Pooled BAC clones • Sequence Capture (Solid phase, Liquid phase) • Agilent, Febit & Nimblegen • Metagenomics & Microbial diversity
Gene Regulation • ChIP-Seq (immunoprecipitate sequencing) • Capture regions of the genome bound by proteins (transcription factors, histones) • Sequences need to be aligned to a reference sequence • Requires complex algorithm to determine differential levels of coverage throughout the genome • Methyl-Seq (methylation status) – Bisulfite Sequencing • Sequences aligned and compared to reference sequence • DNAseI Hypersensitivity Site Sequencing Mikkelsen et al. (2007), Nature
Variant & Structural Variation • Coverage & Alignment • Paired-End Sequencing • Whole Genome Resequencing • Small genomes that are not too complex (repeats, duplications...) • The longer the reads, the better • Targeted Resequencing • Complex genomes (crops) • Reduced representation libraries (methyl-sensitive enzymes) • Transcriptome • Sequence Capture (Microarrays) • Agilent, Febit & Nimblegen • CGH arrays
Challenges in variant discovery • Base quality & filtering (scoring threshold) • Sequencing errors vs. SNPs • To differentiate true polymorphisms from sequencing errors • Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples) • Availability of a reference sequence (genome) • To separate unique vs. duplicated sequences • Duplication in one line but not another • Polymorphism rate in one line vs. another = need to set conditions for alignment • Paired-end sequencing can help unique read placement • Complex genomes = need to reduce complexity prior to sequencing • High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…) • Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...)
Reduced-representation libraries • DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides • Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008). • Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription) PstI sites transposon transposon transposon PstI digestion Recover digested fraction (gel, column) P P P P P P P P
Library Construction Genomic DNA Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation Ligation of biotinylated RE-specific adapters 1 B B Digestion with 4-bp cutter (DpnII) B GATC Ligation of DpnII-specific adapter B GATC CTAG Binding to streptavidin column and digestion with RE GATC CTAG Ligation of RE-specific adapters 2 GATC CTAG PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles) Deschamps et al. The Plant Genome (in press)
SNP detection flowchart Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags) Condensing & optional consensus base-quality filter (for unitags sequences) Creating HQ unitag datasets (removing singlets) Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch Filtering, to accept clusters with only two members (A, B) with exactly one mismatch Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations. Filtering and Condensing Comparing two genotypes Mapping to genome Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes.
100,000 10,000 1,000 Frequency 100 10 1 10 100 1,000 10,000 100,000 Depth Example: one flow cell in soybean (Williams82 vs. Pintado) † Filtered total reads defined as having a quality value for individual base greater than or equal to 15 ‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2. § Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence
Results & Validation * * *SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes
Distribution of HQ unitags & SNPs related to annotated gene density (soybean) Gene Density (excluding TEs) in 500Kb window Coverage by HQ unitags in 70Kb window SNP Density in 70Kb window
Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean Intron, CDS and UTR coordinates determined from GFF annotation files
Bioinformatic tools • Alignment and Polymorphism Detection • SOAP – Short Oligonucleotide Alignment Program • Ruiqiang Li, Beijing Genomics Institute • http://soap.genomics.org.cn • MAQ – Mapping and Assembly with Quality • Heng Li, Sanger Centre • http://maq.sourceforge.net/maq-man.shtml • Bowtie - An ultrafast memory-efficient short read aligner • Ben Langmead and Cole Trapnell, University of Maryland • http://bowtie-bio.sourceforge.net/ • ssahaSNP – Tool to detect homozygous SNPs and indels • Adam Spargo and Zemin Ning, Sanger Centre • http://www.sanger.ac.uk/Software/analysis/ssahaSNP
Bioinformatic tools • Genomic Assembly • Velvet – De novo assembly of short reads • Daniel Zerbino and Ewan Birney, EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet/ • SSAKE – Assembly of short reads • Rene Warren, et al, British Columbia Cancer Agency • http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500 • Euler – Genomic Assembly • Pavel Pevzner and Mark Chaisson, University of California, San Diego • http://nbcr.sdsc.edu/euler/ www.illumina.com
Bioinformatic tools • ChIP Sequencing • ChIP-Seq Peak Finder • Barbara Wold, Cal Tech and Rick Meyers, Stanford University • http://woldlab.caltech.edu/html/software/ • Digital Gene Expression • Comparative Count Display • Alex Lash, NIH • ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/ • SAGE DGED Tool • Cancer Genome Anatomy Project • http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&ORG=Hs www.illumina.com
Bioinformatic tools - Illumina • Overview • Obtain Bustard reads and align against Genome with Eland • Aggregate and SNP call data with CASAVA • GenomeStudio™ wizard import of data • Examine coverage and quality in stacked alignment graphs for a selected region/chromosome • Export table of SNPs and consensus sequence
Next-Generation Sequencing • Second-generation platforms: • 454/Roche • Solexa/Illumina • SOLiD/ABI • Helicos BioSciences • Dover Systems • Third-generation platforms: • Complete Genomics • BioNanomatrix • VisiGen • Pacific Biosciences • Intelligent Bio-Systems • ZS Genetics • Reveo • LightSpeed Genomics • NABsys • Oxford Nanopore Technologies