Special Topics in Genomics

Special Topics in Genomics Next-generation Sequencing

Work flow of conventional versus second-generation sequencing (a) With high-throughput shotgun Sanger sequencing, genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated. Each cycle sequencing reaction takes place within a microliter-scale volume, generating a ladder of ddNTP-terminated, dye-labeled products, which are subjected to high-resolution electrophoretic separation within one of 96 or 384 capillaries in one run of a sequencing instrument. As fluorescently labeled fragments of discrete sizes pass a detector, the four-channel emission spectrum is used to generate a sequencing trace. (b) In shotgun sequencing with cyclic-array methods, common adaptors are ligated to fragmented genomic DNA, which is then subjected to one of several protocols that results in an array of millions of spatially immobilized PCR colonies or 'polonies'15. Each polony consists of many copies of a single shotgun library fragment. As all polonies are tethered to a planar array, a single microliter-scale reagent volume (e.g., for primer hybridization and then for enzymatic extension reactions) can be applied to manipulate all array features in parallel. Similarly, imaging-based detection of fluorescent labels incorporated with each extension can be used to acquire sequencing data on all features in parallel. Successive iterations of enzymatic interrogation and imaging are used to build up a contiguous sequencing read for each array feature. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Available next-generation sequencing platforms • Illumina/Solexa • ABI SOLiD • Roche 454 • Polonator • HeliScope • …

Example: Illumina/Solexa 1. Prepare genomic DNA 2. Attach DNA to surface 3. Bridge amplification 4. Fragement become double stranded 5. Denature the double stranded molecules 6. Complete amplification

Illumina/Solexa 7. Determine first base 8. Image first base 9. Determine second base 10. Image second base 11. Sequence reads over multiple cycles 12. Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From: http://www.illumina.com/downloads/SS_DNAsequencing.pdf)

Clonal amplification of sequencing features in the second-generation sequencing (a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5'-attached) of micron-scale beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library is PCR amplified, but both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence, amplification products originating from any given member of the template library remain locally tethered near the point of origin. At the conclusion of the PCR, each clonal cluster contains 1,000 copies of a single member of the template library. Accurate measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding overcrowding. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Strategies for cyclic array sequencing • With the 454 platform, clonally amplified 28-m beads generated by emulsion PCR serve as sequencing features and are randomly deposited to a microfabricated array of picoliter-scale wells. With pyrosequencing, each cycle consists of the introduction of a single nucleotide species, followed by addition of substrate (luciferin, adenosine 5'-phosphosulphate) to drive light production at wells where polymerase-driven incorporation of that nucleotide took place. This is followed by an apyrase wash to remove unincorporated nucleotide. • (b) With the Solexa technology, a dense array of clonally amplified sequencing features is generated directly on a surface by bridge PCR. Each sequencing cycle includes the simultaneous addition of a mixture of four modified deoxynucleotide species, each bearing one of four fluorescent labels and a reversibly terminating moiety at the 3' hydroxyl position. A modified DNA polymerase drives synchronous extension of primed sequencing features. This is followed by imaging in four channels and then cleavage of both the fluorescent labels and the terminating moiety. • (c) With the SOLiD and the Polonator platforms, clonally amplified 1-m beads are used to generate a disordered, dense array of sequencing features. Sequencing is performed with a ligase, rather than a polymerase. With SOLiD, each sequencing cycle introduces a partially degenerate population of fluorescently labeled octamers. The population is structured such that the label correlates with the identity of the central 2 bp in the octamer (the correlation with 2 bp, rather than 1 bp, is the basis of two-base encoding). After ligation and imaging in four channels, the labeled portion of the octamer (that is, 'zzz') is cleaved via a modified linkage between bases 5 and 6, leaving a free end for another cycle of ligation. Several such cycles will iteratively interrogate an evenly spaced, discontiguous set of bases. The system is then reset (by denaturation of the extended primer), and the process is repeated with a different offset (e.g., a primer set back from the original position by one or several bases) such that a different set of discontiguous bases is interrogated on the next round of serial ligations. • (d) With the HeliScope platform, single nucleic acid molecules are sequenced directly, that is, there is no clonal amplification step required. Poly-A–tailed template molecules are captured by hybridization to surface-tethered poly-T oligomers to yield a disordered array of primed single-molecule sequencing templates. Templates are labeled with Cy3, such that imaging can identify the subset of array coordinates where a sequencing read is expected. Each cycle consists of the polymerase-driven incorporation of a single species of fluorescently labeled nucleotide at a subset of templates, followed by fluorescence imaging of the full array and chemical cleavage of the label. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Conventional sequencing • Can sequence up to 1,000 bp, and per-base 'raw' accuracies as high as 99.999%. In the context of high-throughput shotgun genomic sequencing, Sanger sequencing costs on the order of $0.50 per kilobase. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Second-generation DNA sequencing technologies Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Applications of next-generation sequencing Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Base calling Schematic representation of main Illumina noise factors. (a–d) A DNA cluster comprises identical DNA templates (colored boxes) that are attached to the flow cell. Nascent strands (black boxes) and DNA polymerase (black ovals) are depicted. (a) In the ideal situation, after several cycles the signal (green arrows) is strong, coherent and corresponds to the interrogated position. (b) Phasing noise introduces lagging (blue arrows) and leading (red arrow) nascent strands, which transmit a mixture of signals. (c) Fading is attributed to loss of material that reduces the signal intensity (c). (d) Changes in the fluorophore cross-talk cause misinterpretation of the received signal (teal arrows; d). For simplicity, the noise factors are presented separately from each other. Erlich et al. Nature Methods 5: 679-682 (2008)

Base calling: Alta-Cyclic The training process (green arrows) starts with creation of the training set, beginning with sequences generated by the standard Illumina pipeline, by linking intensity reads and a corresponding genome sequence (the 'correct' sequence). Then, two grid searches are used to optimize the parameters to call the bases. After optimization, a final SVM array is created, each of which corresponds to a cycle. In the base-calling stage (blue arrows), the intensity files of the desired library undergo deconvolution to correct for phasing noise using the optimized values and are sent for classification with the SVM array. The output is processed, and sequences and quality scores are reported. Erlich et al. Nature Methods 5: 679-682 (2008)

Alta-Cyclic performance (a) Analysis of the HepG2 RNA library using Alta-Cyclic. The absolute number of additional fully correct reads (in addition to those generated by the Illumina base caller) is indicated by the red line; the fold change of the improvement is indicated by the blue bars. (b) A comparison of fully correct reads for the Tetrahymena micronuclear library by the Illumina base caller and Alta-Cyclic. (c) The average error rate in calls of the artificial SNP locations in the phi X library as a function of the cycle in which they were called. The dashed line represents 1% error rate (Q20). The plot on the right shows the last 18 cycles in a different scale. (d) A comparison of fully correct reads for the phi X library with 1% artificial SNPs. (e) Phi X sequences generated by Alta-Cyclic or Illumina were exhaustively aligned to the reference genome (allowing up to 53 mismatches out of 78). The distribution of alignment scores is shown beginning with an identical number of raw reads for input into each base caller. Erlich et al. Nature Methods 5: 679-682 (2008)

ChIP-Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

ChIP-Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis

Alignment • ELAND • Bowtie • SOAP • SeqMap • …

Peak detection • FindPeaks • CHiPSeq • BS-Seq • SISSRs • QuEST • MACS • CisGenome • …

Two common designs • One sample experiment contains only a ChIP’d sample • Two sample experiment contains a ChIP’d sample and a negative control sample

One sample analysis A simple way is the sliding window method ki Poisson background model is commonly used to estimate error rate ki ~ Poisson(λ0) Or people use Monte Carlo simulations Both are based on the assumption that read sampling rate is a constant across the genome. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

The constant rate assumption does not hold! Negative binomial model fits the data better! ki | λi ~ Poisson(λi) λi ~ Gamma(α, β) Marginally, ki ~ NegBinom(α, β) Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

FDR estimation based on Poisson and negative binomial model Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

CisGenome procedure Alignment Exploration FDR computation Negative binomial model Peak Detection Use read direction to refine peak boundary and filter low quality peaks Post Processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Two sample analysis Reason: read sample rates at the same genomic locus are correlated across different samples. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

CisGenome two sample analysis Alignment Exploration k1i k2i FDR computation ni =k1i + k2i k1i | ni ~ Binom(ni , p0) Peak Detection Post Processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

A comparative study of ChIP-chip and ChIP-seq • NRSF ChIP-chip 2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix Human Tiling 2.0R arrays. • NRSF ChIP-seq ChIP + Negative Control in Jurkat cells, sequenced with the next generation sequencer made by Illumina/Solexa. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Intersection Before post-processing After post-processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Signal correlation Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Visual comparison Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Comparison of peak detection results Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Are array specific peaks noise or signal? Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Effects of read number in ChIP-seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

RNA-Seq • After two rounds of poly(A) selection, RNA is fragmented to an average length of 200 nt by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming. The cDNA is then converted into a molecular library for Illumina/Solexa 1G sequencing, and the resulting 25-bp reads are mapped onto the genome. Normalized transcript prevalence is calculated with an algorithm from the ERANGE package. • (b) Primary data from mouse muscle RNAs that map uniquely in the genome to a 1-kb region of the Myf6 locus, including reads that span introns. The RNA-Seq graph above the gene model summarizes the quantity of reads, so that each point represents the number of reads covering each nucleotide, per million mapped reads (normalized scale of 0–5.5 reads). • (c) Detection and quantification of differential expression. Mouse poly(A)-selected RNAs from brain, liver and skeletal muscle for a 20-kb region of chromosome 10 containing Myf6 and its paralog Myf5, which are muscle specific. In muscle, Myf6 is highly expressed in mature muscle, whereas Myf5 is expressed at very low levels from a small number of cells. The specificity of RNA-Seq is high: Myf6 expression is known to be highly muscle specific, and only 4 reads out of 71 million total liver and brain mapped reads were assigned to the Myf6 gene model. Mortazavi et al. Nature Methods,5:621-628, 2008

Reproducibility, linearity and sensitivity (a) Comparison of two brain technical replicate RNA-Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R2 = 0.96. (b) Distribution of uniquely mappable reads onto gene parts in the liver sample. Although 93% of the reads fall onto exons or the RNAFAR-enriched regions (see Fig. 3 and text), another 4% of the reads falls onto introns and 3% in intergenic regions. (c) Six in vitro–synthesized reference transcripts of lengths 0.3–10 kb were added to the liver RNA sample (1.2 104 to 1.2 109 transcripts per sample; R2 > 0.99). (d) Robustness of RPKM measurement as a function of RPKM expression level and depth of sequencing. Subsets of the entire liver dataset (with 41 million mapped unique + splice + multireads) were used to calculate the expression level of genes in four different expression classes to their final expression level. Although the measured expression level of the 211 most highly expressed genes (black and cyan) was effectively unchanged after 8 million mappable reads, the measured expression levels of the other two classes (purple and red) converged more slowly. The fraction of genes for which the measured expression level was within 5% of the final value is reported. 3 RPKM corresponds to approximately one transcript per cell in liver. The corresponding number of spliced reads in each subset is shown on the top x axis. Mortazavi et al. Nature Methods,5:621-628, 2008

ERANGE (a) The main steps in the computational pipeline are outlined at left, with different aspects of read assignment and weighting diagrammed at right and the corresponding number of gene model reads treated in muscle shown in parentheses. In each step, the sequence read or reads being assigned by the algorithm are shown as a black rectangle, and their assignment to one or more gene models is indicated in color. Sequence reads falling outside known or predicted regions are shown in gray. RNAFAR regions (clusters of reads that do not belong to any gene model in our reference set) are shown as dotted lines. They can either be assigned to neighboring gene models, if they are within a specified threshold radius (purple), or assigned their own predicted transcript model (green). Multireads (shown as parallelograms) are assigned fractionally to their different possible locations based on the expression levels of their respective gene models as described in the text. (b) Comparison of mouse liver expanded RPKM values to publicly available Affymetrix microarray intensities from GEO (GSE6850) for genes called as present by Rosetta Resolver. Expanded RPKMs include unique reads, spliced reads and RNAFAR candidate exon aggregation, but not multireads. Genes with >30% contribution of multireads to their final RPKM (Supplementary Fig. 4) are marked in red. (c) Comparison of Affymetrix intensity values with final RPKMs, which includes multireads. Note that the multiread-affected genes that are below the regression line in b straddle the regression line in c. Mortazavi et al. Nature Methods,5:621-628, 2008

Summary The next-generation sequencing is • Rapidly evolving • Democratizing the extent to which individual investigators can pursue projects at a scale previously accessible only to major genome centers • New statistical tools need to be developed

Special Topics in Genomics