The Past, Present, and Future of DNA Sequencing

The Past, Present, and Future of DNA Sequencing Craig A. Praul Co- Director Genomics Core Facility Huck Institutes of the Life Sciences Penn State University

A very short history of DNA sequencing

I started from the conviction that, if different DNA species exhibited different biological activities, there should also exist chemically demonstrable differences between deoxyribonucleic acids. Edwin Chargaff

Milestones • First Isolation of DNA : 1867 (FreidrichMeisher) • Composition of nucleic acids; tetranucleotide theory : 1909 - 1940 (Phoebus Levine) • G=C and A=T however, the G/C and A/T content of different organisms vary : 1950 (Edwin Chargaff) • G/C content measured by annealing : 1968 (Mandel and Marmur) • Maxam-Gilbert and Sanger Sequencing : 1977 • Next-Generation Sequencing : 2005

Genomes Sequenced • Virus – 3222 (Bacteriophage phi X 174, 5386 nt – 1977) • Bacteria – 2289 (Haemophilus influenza, 1.8 x 106nt– 1995) • Eukarya – 168 (S. cerevisiae1.2 x 107nt– 1995; H. sapien, 3x 109nt-2001) • Archaea – 152 (Methanococcusjannaschi , 1.7 x 106nt– 1996)

Next-Generation Sequencing Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364

Changes in instrument capacity* ER Mardis. Nature470, 198-203 (2011) doi:10.1038/nature09796

Sequencing Cost Source - NHGRI : http://www.genome.gov/sequencingcosts/

Central Dogma of Molecular Biology James Watson version - 1965 DNA RNA Protein So once we have the genomic DNA sequence of a species we have all of the information there is? Really?

No, not really.

IlluminaHiSeq and MiSeq • Massively parallel • HiSeq : 150 or 180 million reads per lane • MiSeq : 15 million reads per run • Intermediate Read Length • HiSeq : 100 nt or 150 nt • MiSeq : 250 nt • High total output per run • HiSeq : 90 GB or 288 GB • MiSeq : 8 GB

Sequencing Types Single Read Paired-end read Mate-pair read

Library Types • Many different library preps : DNA, mate-pair, mRNA, miRNA, ChIP • Fragmentation • DNA : 300 – 500 nt • RNA : 150 – 200 nt • Attachment of appropriate adapters • Complex : flow cell binding, F & R sequencing, BC • Custom : Avoid if possible • Removal of dimers/small inserts • Amplification (or not)

Applications • de Novo sequencing (genomes, transcriptomes) • Resequencing (genomes, exomes, custom sequence capture) • RNA-seq (mRNA, miRNA, degradome) • Chip-Seq • Methyl-seq • RIP-seq • Amplicon

de Novo Experimental Design • Estimate of genome size • Coverage (30 x – 100 x) • Sequencing Type (paired-end or mate-pair) • Example 100 MB genome, 100 x 100 nt paired-end reads • (100 MB) x (30 x coverage) = 3 GB • 3 GB / (200 nt for each pair of paired-end reads) = 15 million read pairs • Replicates

Resequencing : Sequence Capture

RNA-seq Experimental Design • Estimate of transcriptome size (1-5% of genome ?) • Coverage (30 x ?) • mRNA or rRNA depleted RNA • Relative abundance of transcripts you are interested in • Sequencing Type (single read or paired-end) • Simple transcriptome vs. complex transcriptome • Splice variants • Example 3 GB genome, 100 nt single reads • (3 GB genome) x ( 5% transcriptome ) = 120 MB Transcriptome • (120 MB transcriptome) x (30 x coverage) = 4.5 GB total sequence • 4.5 GB / (100 nt for each read) = 45 million read pairs • Replicates : Yes!!!! • Biological not technical

ChIP-Seq http://www.nature.com/nmeth/journal/v4/n8/images/nmeth0807-613-F1.gif

RIP-seq Source : http://openi.nlm.nih.gov/imgs/rescaled512/3269675_ijms-13-00097f6.png

Methyl-seq 20 different types of base modifications in DNA are known and there are perhaps 200 modifications of RNA

Experimental Space: Next-Gen Platform • PacBio: 0.075 x 106 reads/sample, 1000 – 3000 nt • Whole transcript • Roche 454 FLX+ : 0.5 -1 x 106 reads/sample, 800 -1000 nt • Small – Medium Genome de novo sequencing • Long Amplicon • Transcriptome • PGM: 1-2 x 106 reads per sample, 400 nt • Small genome de novo • Medium Amplicon • MiSeq: 1-2 x 106 reads per sample, 50 – 250 nt • Small genome de Novo • Small Amplicon • HiSeq : 10-100 x 106 reads per sample, 50 – 150 nt • Counting Applications : RNA-seq, ChIP-seq, RIP-seq, Methyl-seq • Large genome de novo and resequencing

Experimental Space: The Relevancy of “Classic” Techniques Differential Gene Expression • Northern blotting (1977) : 1 Probe – 20 samples • Dot Blots (1987) : 100s of probes – 1 sample • RT-PCR (1992) : 100s of probes – 10 -100 samples • Microarrays (1995 ) : 100,000s of probes – 1 sample • Next-gen sequencing (2005) : 10-100 x 106 reads – 1 sample

The Future • More Reads • Longer Reads • Faster Sequencing • Cheaper Sequencing • New Applications

The Past, Present, and Future of DNA Sequencing

The Past, Present, and Future of DNA Sequencing

Presentation Transcript

Past, Present and Future

The Past, Present and Future of

DNA Sequencing: Present Status and Future Challenges

Past, present and Future

The past, present, and future of DNA sequencing

Past, Present, and Future

Past Present and Future

Past, Present, and Future

Past, Present and Future

Past, Present and Future

- Past, Present and Future

- Past, Present and Future

Past, present and future

Past, Present and Future

The Past, Present and Future of

Past, Present and Future

- Past, Present and Future

- Past, Present and Future

Past, Present and Future

- Past, Present and Future

The Past, Present and Future of