NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science

NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science Session II: Genome Sequencing

Genome and EST sequencing • Sequencing Technologies • Informatics Tools • Sequencing project approaches • EST sequencing projects • Genome sequencing projects • What we have learned The Summer Institute 2004

Some Terms • Complementary – nucleotide sequences that will form specific hybrids • Hybridize – duplex formation • Label – a molecular tag that facilitate detection • Oligonucleotide – a short single-stranded piece of nucleic acid • Anneal – to incubate nucleic acid species together under conditions that promote specific hybridization The Summer Institute 2004

Why study genomes • Molecular biology and biochemistry need a point of entry • Genetics is reliant on phenotype • Hypothesis driven versus data production - parallels with early Naturalists and modern day physics • Identify similarities and differences amongst diverse life forms The Summer Institute 2004

Data mining vs. Data Dredging The Summer Institute 2004

Gene structural features Sequence read • Hybridization of complementary strands • Specificity of base pairing • Almost any DNA is clonable • You can have the same sequence - but different genes cDNA Genomic DNA Exon polyA tail The Summer Institute 2004

Sequencing Technologies • Basic principles • Dideoxy chain termination • Electrophoretic separation • Visualization • Innovations • Fluorescent tags • Thermocycling • Capillary electrophoresis • Novel methodologies • Sequencing by hybridization • Mass spectrometry • Nanopore sequencing • Other things of note The Summer Institute 2004

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ 5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ 3’-AAATCTAGCTAAGCT-5’ AAATCTAGCTAAGCT-5’ Primer extension 5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ 3’-AAATCTAGCTAAGCT-5’ • The extended molecule is the reverse complement of the target • The extended molecule can be tagged for visualization • Extension occurs via a 3’ hydroxyl group The Summer Institute 2004

Dideoxy chain termination Dideoxy dNTPs will terminate extension because they lack a 3’-OH By mixing ddATP with dATP a pool of extension products is created wherein termination at each available A occurs The termination products can be separated by size and visualized by labeling either the ddNTP or the primer 5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ ATCGGTCAAATCTAGCTAAGCT-5’ 5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ ATCGATCGGTCAAATCTAGCTAAGCT-5’ 5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ ATCGATCGATCGGTCAAATCTAGCTAAGCT-5’ The Summer Institute 2004

Sanger sequencing T C G A A G C T ddNTP/dNTP mixtures are made up for each of the four nucleotides - adenine, cytosine, guanine, thymine Proportion of dideoxy to deoxy NTP determines the frequency of termination Products from the four reactions are separated by size and DNA sequence is inferred Invert gel to read the sequence 5’ to 3’ The Summer Institute 2004

Fluorescent sequencing Each ddNTP is labeled with a different fluor - now all four products can be run in the same gel lane Fluorescence is detected using a laser scanner to produce a false color image Electropherograms (chromatograms) are produced that display peak intensity for each fluor Can also differentially label the primer to achieve the same end The Summer Institute 2004

Cycle sequencing with PCR Sanger sequencing can require large amounts of template Polymerase chain reaction exponentially amplifies specific DNAs Use of ddNTPs allows the combination of amplification and dideoxy terminator sequencing Cycle sequencing animation press The Summer Institute 2004

High-throughput sequencing • Dideoxy terminator sequencing is robust and flexible • Microtiter format • PCR based cycle sequencing requires less template • Fluorescent sequencing increased gel capacity 4X • Supporting robotics upstream of sequencing process • Computational tools • Capillary sequencers The Summer Institute 2004

Capillary gels Slab gels make life in the sequencing lab difficult for many reasons: Pouring the gel is time consuming and prone to error The microtiter plate format (sequencing reactions) has spacing that is different than the gel loading comb - cumbersome Assembly and disassembly of the sequencing apparatus is messy and time consuming Manual lane tracking is time consuming and prone to error Gels never run perfectly - lanes can sometimes run together making lane tracking difficult Capillary gels help because Each sequencing reaction is run in a separate capillary - there is no lane tracking to worry over Matrix for the capillary gel is robotically assembled, injected and QC’d Robotic loading of samples is compatible with walk-away capability The Summer Institute 2004

Informatics essentials • Basecallers convert trace data to sequence • Assemblers form contiguous sequences from small chunks • Viewers/Editors allow the scientist to interactively work with data • Databases store sequencing data - from electropherograms to annotation • Analysis tools compare the sequence against databases of sequences and use algorithms to make educated guesses about the structure and function of a given sequence The Summer Institute 2004

Basecalling • Is the spacing of the peaks what is expected? • Is there a peak in the electropherogram? • What fluor is responsible for this peak? • Since noise ensures the presence of more than one peak, which peak is the correct peak? • What is the probability that the base that is assigned is the correct base? • Phred score - Phred 20 (1 error in 100 bases) is a typical quality standard • TraceTuner - algorithm is similar to Phred but reportedly more accurate with ABI3700 traces, plus accelerated execution • Others are available The Summer Institute 2004

Assembly • Production of a single contiguous sequence from multiple sequence reads • The best assembly programs (including Phrap) use probability scores directly from the output of basecallers such as Phred • Phrap was designed for genome sequencing projects - EST assemblies make different assumptions • Final assembly products include contigs and singletons • Accuracy of the contig consensus sequence is based on error models propagated from basecalling software The Summer Institute 2004

Viewers/editors Consed break The Summer Institute 2004

Storage and analysis of sequence • The amount of sequence information deposited in databases is increasing at a very rapid rate • Tools to manage sequence data are imperfect and in development • Development of controlled vocabularies and gene ontologies will facilitate database integration • Analytical tools and algorithm development are growth industries http://www.ebi.ac.uk/Databases/index.html http://www.ncbi.nlm.nih.gov/entrez/query.fcgi http://www.ncbi.nlm.nih.gov/ The Summer Institute 2004

Impact of database structure • Flat file databases are great for speed but are not built for integration • Lack of controlled vocabularies impedes efficient and reliable searching and inhibits integration • GenBank uses a controlled index and vocabulary - sort of • Example of searching for genomic sequence, EST sequence and complete cDNAs • Relational databases are great for integration but can be slow and changing the schema takes an act of Congress • Flat file databases with robust re-indexing routines have the advantage of speed and the ability to integrate different data types The Summer Institute 2004

TGCATGCATGCA Sequencing by hybridization TGCATG…1 GCATGC…3 CATGCA…9 ATGCAT…2 TGCATG…12 GCATGC…3 CATGCA…9 1 2 3 4 5 6 7 8 9 10 11 12 ACGTAC TACGTA CGTACG TTCCGG AATTCC GGGCCC CGTACG CACAGA GTACGT CGCGGA AGCAGC ACGTAC Determine constituent sequences by hybridizing to oligos of known sequence Assemble sequence fragments into contiguous sequence The Summer Institute 2004

Sequencing by mass spectrometry TGCATG GCATGC Obtain Mass spectra from Reference panel of oligos CATGCA ATGCAT TGCATG GCATGC Fragment unknown and obtain mass spectra Deconvolute data The Summer Institute 2004

454 The Summer Institute 2004

US Genomics The Summer Institute 2004

Nanopore sequencing • Two solution filled compartments separated by a membrane with a channel • Ions flow through the channel in response to an applied voltage • DNA is negatively charged and will be drawn through the channel • Channel size allows DNA molecules to be drawn into and through the channel one at a time • Current is reduced when the channel is occupied by DNA • Length of current drop is proportional to length of DNA • Extent of current drop is indicative of physicochemical properties of DNA - thus, one can infer sequence from the trace The Summer Institute 2004

Sequencing project approaches • EST projects • Map-based: assembly based on physical ordering of clones • Shotgun: assembly based on computational ordering of sequences • Combination strategies: minimal scaffolding from physical maps, fill in the blanks by shotgun and directed sequencing The Summer Institute 2004

EST sequencing projects • Only the expressed genome is sequenced, thereby avoiding the “junk” • Relatively inexpensive and fast - accessible to small laboratories • May fail to capture many genes because the appropriate biological condition leading to expression is not captured • May overestimate gene number due to non-overlapping sequences from the same gene The Summer Institute 2004

Project is the operative word The Summer Institute 2004

Libraries of overlapping clones • Library clones can be ordered by the presence of restriction sites, known sequences, etc. • Assembly of contiguous sequences is straightforward because the clones form an ordered array The Summer Institute 2004

Map-based sequencing • Produce large insert libraries in BACs, cosmids, etc. to “cover” the genome multiple times • Determine a minimal tiling path of clones by restriction mapping, hybridization of end based probes or end sequencing • Ordered sets of clones are subcloned into pools of small clones • Smaller clones can be order or sequenced by shotgun methods • Fewer sequencing runs = lower costs • Obtaining an ordered array of clones can be time consuming The Summer Institute 2004

Shotgun sequencing • Produce sequences from random clones irrespective of their physical order along the chromosomes • Clones can be small insert or large insert because alignment takes into account only the sequence - not properties of the physical clones • Assemble sequences to produce contigs • Identify gaps in contiguous sequence and undersequenced areas • Perform directed sequencing to fill in the gaps The Summer Institute 2004

Shotgun sequencing issues • Assembly is computationally intensive • Repetitive sequences have to be masked so that they do not confound the preliminary alignment • First pass alignment based upon non-masked sequences to produce contiguous sequence fragments • Alignments must account for potential polymorphisms • Repetitive sequences still need to be aligned - their treatment is however distinct from non-repetitive sequences • Resolution of conflicts in the assembly is challenging • When is a genome truly finished? • The press release is only the beginning of the process The Summer Institute 2004

Complementary strategies • Pure shotgun approaches are likely to leave significant gaps • Directed sequencing of specific regions is necessary to fill in the gaps • Pure map-based strategies are cumbersome and time consuming and do not take advantage of efficiencies of scale found in modern industrial sequencing • A complementary approach combines data from both approaches • There are adherents to working from the bottom-up and working from the top-down The Summer Institute 2004

Genome sequencing projects The Summer Institute 2004

What is a gene • ESTs and cDNAs identify those parts of the genome that are actually transcribed • Transcripts have structural features including starts, stops and open reading frames • Computers can be trained to “sniff” for relevant features in the sequence • Genefinding algorithms construct probability models based on presence of one or more gene-like features • Coordination with genetic features gives a comfort level because it is empirical • Computational methods that rely on similarity to “known” genes in databases can be perilous - a sort of regressive uncertainty The Summer Institute 2004

BLAST Example Sequence BLAST Example The Summer Institute 2004

How to make a human The Summer Institute 2004

The Human Genome Project http://www.nature.com/genomics/human/index.html http://www.sciencemag.org/content/vol291/issue5507/ The Summer Institute 2004

Genome information challenges • Data integration from sequence, mutant analysis, mapping, expression analysis, metabolic profiling, and other data types will be the primary challenge to biological science in the 21st century • Informatics tools are in their infancy • The literature is growing at a rate surpassing sequence data • Importance of statistics cannot be overstated • Gene annotation is regressive • Danger of balkanization of data? • Is natural language processing the holy grail? Link to Ensembl Link to FlyBase Link to Entrez Genomes Link to ExPASy Link to SachDB Link to KEGG Link to TAIR The Summer Institute 2004

NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science

NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science

Presentation Transcript

Making the Bomb: Understanding Nuclear Weapons

Goverdhan Mehta, President International Council for Science (ICSU) and Indian Institute of Science, India

Genetics

Statistical Machine Translation

Classical and Modern Genetics

What’s New in Statistical Machine Translation

Forward genetics and map-based cloning approaches

Objectives

CHAPTER 10 Bacterial Genetics

Multivariate Statistical Analysis

Integrating Genetic and Biomarker Data with Social Science Research: Genetics

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scien

SLAM Summer School 2004

Genetics

By: Pulakesh Maiti Indian Statistical Institute

Welcome Each of You to My Molecular Biology Class

Genetics!

The Spatial Scan Statistic

NCAP Summer School 2010 Tutorial on: Deep Learning

Genome-wide association studies (GWAS)

Development of the sow caliper

What’s New in Statistical Machine Translation