Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Genomics and Personalized Care in Health SystemsLecture 10. High Throughput Technologies Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

Outline • Polymerase Chain Reaction (PCR) • Genome Sequencing • Microarray • Pathway Analysis

Polymerase Chain Reaction (PCR)

Polymerase Chain Reaction (PCR) • A technique that allows us to generate a large number of a particular DNA sequence from an extremely small sample • Procedure: • Determine one particular sequence – the target sequence • Mix sample, primers, nucleotides to build new DNA strands • Apply cycles of heating, cooling, reheating on the mixture • The number of the target in the mixture will grow exponentially with the number of cycles • Primer selection is critical. The primers should be at least 15-20 bases to ensure specificity. • If you are unsure of the exact sequence, you can use a mixture of primers (vary at third codon position)

PCR Double-stranded DNA target primers Primers are complementary to opposite ends of target seq.

PCR

PCR Applications • Making a lot of protein • Use RT-PCR, “reverse transcriptase” PCR, to create DNA with introns removed and then insert it into bacteria to clone the gene. e.g. to make proteins for X-ray crystallography. • Medical diagnosis • Detect HIV viral proteins long before AIDS symptoms arise • Rapid tuberculosis test • Forensics • Detect trace amounts of DNA at a crime scene • …

Genome Sequencing

DNA Sequencing • The process of determining the order of the nucleotide bases along a DNA strand • In 1977 two separate methods for sequencing DNA were developed: • Chain termination method (Sanger et al.) • Chemical degradation method(Maxam and Gilbert) • Both methods were equally popular to begin with, but, for many reasons, the chain termination method is the method more commonly used later • Chain termination method is based on the principle that single-stranded DNA molecules that differ in length by just a single nucleotidecan be separated from one another using polyacrylamide gel electrophoresis

Chain Termination Method • Idea: If we know the distance of each type of base from a known origin, then it is possible to deduce the sequence of the DNA. • For example, if we knew that there was an: • A at positions 2, 3, 11, 13 ... G at positions 1, 12, ... C at positions 6, 7, 8, 10, 15... T at positions 4, 5, 9, 14....then we can reconstruct the sequence • Obtaining this information is conceptually simple. The idea is to cause a termination of a growing DNA chain at a known base (A,G,C or T) and at a known location in the DNA • In practice, chain termination is caused by the inclusion of a small amount of a single dideoxynucleotide base in the mixture of all four normal bases (e.g. dATP, dTTP, dCTP, dGTP and ddATP). The small amount of ddATP would cause chain termination whenever it would be incorporated into the DNA.

Automatic DNA sequencing

Whole Genome Shotgun Sequencing

Metrics for Evaluating Sequencing Methods • Throughput • Number of high quality bases per unit time • Difficulty of sample preparation • Number of independent samples run in parallel - multiplexing • Yield • Number of useful reads per sample • Read length • Cost • Per run and per base; Equipment; Reagents; Infrastructure; Labor; Analysis • The goal of all new sequencing technologies is to increase throughput and yield while reducing cost

Sanger Sequencing • Radiolabeled dideoxyNTPs • 800 bp reads • Low throughput (several kb/gel)

Next Generation Sequencing • Increasing sequencing production • Massive parallelization • Reduction in per-base cost • Eliminate need for huge infrastructure • Millions of reads (>1Gb sequence per run) • Technologies • 454 • SOLiD • Illumina • … • Challenges • Read length • Quality • Data analysis

454 • Throughput & Yield • 1 million 400 bp reads/10 hour run • >8 samples/run (more with barcoding) • Cost • Machine: $500k; reagents ~$8000k/run • Issues • High indel rate in homopolymers • Longer reads but fewer than other systems

Other Short Read Technologies • Illumina • Sequencing by synthesis • 100 million 36-75 bp reads/run • $6500 in reagent cost/run • 3-6 day run time • SOLiD • Sequencing by ligation • ~400 million 35-50 bp reads/run • ~$5000 in reagent cost/run • 3-6 day run time • Helicos • Sequencing by synthesis • No amplification • 750 million reads/run • $18k run cost • 8 day run time

Third-Generation Sequencing • Extremely high-throughput sequencing at very low cost • Pacific Biosciences • Sequence in real time with fluorescent NTPs • Rate limited by processivity of polymerase • Very long reads (>10 kb) • Not well parallelized (few reads) • Nanopore sequencing • Sequencing by exonuclease cleavage of native DNA • Bases are read as they pass through a modified nanopore • base-specific change in current

Genome Sequencing Videos • Wash U Genome Center • http://www.nslc.wustl.edu/elgin/genomics/gscmaterials.html • Sanger Technology Tour Videos • Next Generation Technology Tour Videos http://gep.wustl.edu/curriculum/course_materials_WU/introduction_to_genomics/nextgen_video_tour • Other videos • PCR: http://www.youtube.com/watch?v=eEcy9k_KsDI • Sanger: http://www.youtube.com/watch?v=aPN8LP4YxPo • SOLiD: http://www.youtube.com/watch?v=nlvyF8bFDwM • Solexa: http://www.youtube.com/watch?v=77r5p8IBwJk • Helicos: http://www.youtube.com/watch?v=TboL7wODBj4

DNA Sequence Assembly

Outline • Basic concepts in sequence assembly • whole-genome shotgun methods • Sources of error in assemblies • Repeats • Polymorphism • Sequencing errors • Alignment and assembly of next-generation sequencing data • Tiling reads onto reference vs. de novo assemblies

Whole Genome Assembly • Multiple copies of the genome are broken into pieces • Both ends of every piece are read. • Length (and orientation) of each piece form constraints. • Reads: 500-1000 bp • Quality array for each position. • Reconstruct genome from reads and constraints. • Issues: both ends of a read usually low quality, chimeric reads, repetitive regions.

DNA Sequencing Data Set • Millions of reads, some of them are low quality reads • Millions of constraints, such as paired ends, quality values • After removing repeats, if two reads overlap large enough, merge • A contig is an ordered and oriented list of overlapping reads. • A scaffold is an ordered and oriented list of contigs.

Scaffolds

Sequence Assembly: Basic Approach Generate reads Find overlapping reads Assemble reads into contigs Join contigs into scaffolds using mate pairs Join scaffolds into “finished” sequence

Alignment and Assembly with Short Reads • Map to reference genome • Many tools • De novo assembly • Much harder • Reference-guided assembly (MOSAIK) • “True” de novo assebmly (Velvet)

Many DNA Assembly Systems • PHREP • CAP • Euler • Celera Assembler • Arachne • LSA

Microarray Technique

Microarrays • Used to study gene expression levels in cells. • Cells can differ dramatically in the amounts of various proteins that they synthesize; e.g. due to different cell types or different external/internal conditions. • In fact, in higher level organisms only a fraction of the genes in a cell are expressed at a given time, and that subset depends on the cell type. • Via microarrays it is possible to study the expression levels of tens of thousands of genes simultaneously.

Microarray Technology • A microarray is a glass slide with spots of DNA on it; each spot is a probe (or target). Thousands of probes can fit on a single slide. The slides can be spotted by robots. • The DNA is single-stranded cDNA and may consist of an entire gene or part of one • If the microarray is exposed to a solution containing mRNA, then the mRNA molecules will bind to those probes to which they are complementary • Genes you can study with a microarray depends on the collection of probes on it. • There are a number of commercial manufacturers; e.g. Affymetrix

Microarray Probes Single-stranded cDNA sequences

Microarray Experiments • Start with two cell types, e.g. “healthy” and “diseased”. • Isolate mRNA from each cell type, generate cDNA with fluorescent dyes attached, e.g. green for healthy and red for diseased. • Mix the cDNA samples and incubate with the microarray. • After incubation the cDNA in the samples has had a chance to bind (hybridize) with the probes on the chip. • The chip is read by a scanner that uses lasers to excite the fluorescent tags; the intensity levels of the dyes are recorded for each probe gene and stored in a computer.

The Colors of a Microarray • Green: control DNA, where either DNA or cDNA derived from normal tissue is hybridized to the target DNA • Red: sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA • Yellow: a combination of control and sample DNA, where both hybridized equally to the target DNA • Black: areas where neither the control nor sample DNA hybridized to the target DNA • The location and intensity of a color can tell us whether the gene, or mutation, is presented in either the control and/or sample DNA • It may also provide estimate of the expression level of the gene(s) in the sample and control DNA

Microarray Data Representation • Microarray data is often arranged in an n x m matrix M with rows for n genes and columns for m biological samples in which gene expression has been monitored. • mijis the expression level of gene i in sample j. • A row ei is the gene expression pattern of gene i over all the samples. • A column sj is the expression level of all genes in a sample j and is called the sample expression pattern

Microarray Data Analysis • Gene chips allow the simultaneous monitoring of the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data • Statistical hypothesis tests for differential expression analysis • Principal component analysis and other methods for visualizing high-dimensional microarray data • Cluster analysis for grouping together genes or samples with similar expression patterns • Different clustering algorithms may be used, e.g. hierarchical with different metrics, or k-means, k-medians. • Hidden Markov models, neural networks and other classifiers for predictively classifying sample expression patters as one of several types (diseased vs. normal)

For What Do We Use Microarray Data Genes with unusual expression levels in a sample Genes whose expression levels vary across samples This can be used to compare normal and diseased tissues or diseased tissue before and after treatment. Samples that have similar expression patterns This can also be used to compare normal and diseased tissues or diseased tissue before and after treatment. Tissues that might be diseased We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue.

Statistical Methods Can Help • Data Pre-processing • Normalization: rescaling data from different microarrays so that they can be compared • Center: subtracting the mean and dividing by the variance. • Data Visualization • Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions. This allows us to visualize it. • Cluster Analysis • By associating genes with similar expression patterns, we might be able to draw conclusions about their functional expression. • Statistical Inference • This is the formulation and statistical testing of a hypothesis and alternative hypothesis. • Classifiers for the Data • We can construct classes from data, such a diseased vs. non-diseased tissue. We can build a model that fits know data for the different classes. This can the be used to classify previously unclassified data.

Measuring Dissimilarity of Expression Data We might want to compare two or more gene or sample expression patterns This might be used to differentiate between diseased and normal cells or finding out the genetic similarity of tissues. To do this we need a distance metric or a dissimilarity measure.

Example Distance Metric Euclidean Distance-This is the most common distance measure. This should not be used if either Not all components of the vectors being compared have equal weight. There is missing data. Preprocessing the data can often alleviate these problems. We can also use the normalized Euclidean distance

Cluster Analysis of Microarray Data Hierarchical Clustering-Assume each data point is in a singleton cluster. Find the two clusters that are closest together. Combine these to form a new cluster. Compute the distance from all clusters to new cluster using some form of averaging. Find the two closest clusters and repeat. K-Means Clustering: partitions the data into k clusters and finds cluster means for each cluster. Usually, the number of clusters k is fixed in advance. To choose k something must be know about the data. There might be a range of possible k values. To decide which is best, optimization of a quantity that maximizes cluster tightness i.e. minimizes distances between points in a cluster

Challenges in Microarray Analysis • Different platforms • Ilumina, Affymetrix, Agilent…. • Many file types, many data formats • Need to learn platform dependent methods and software required • Analysis • How to get started? • Which methods? Which software? • Many freely available tools. Some commercial • How to interpret results

Public Databases • Many sources for public data – labs, consortia, government • Publications require that data files including raw files be made public • GEO • http://www.ncbi.nlm.nih.gov/geo/ • Array Express • http://www.ebi.ac.uk/arrayexpress/#ae-main[0]

Data Analysis • Class discovery • Class comparison • Class prediction • Biological annotation • Pathway analysis

Hierarchical Clustering • Eisen Cluster and Treeview • http://rana.lbl.gov/EisenSoftware.htm • Import data • Filter • Filter or not to filter, %P calls, SD etc • Adjust data • Log transform, center, normalize • Clustering • Cluster array or genes • Computationally intensive • Choose distance metric • .cdt file created • Open with Treeview

Cluster from Microarray Data

Experimental Design • Sample size • How many samples in test and control • Replicates • Technical vs. biological • Biological replicates is more important for more heterogeneous samples • Need replicates for statistical analysis • All experimental steps from sample acquisition to hybridization • Microarray experiments are very expensive. So, plan experiments carefully

Video on YouTube • DNA Microarray • http://www.youtube.com/watch?v=VNsThMNjKhM&

Pathway Analysis

KEGG • Kyoto Encyclopedia of Genes and Genomes (KEGG) http://www.genome.jp/kegg/pathway.html

Biological Pathways http://www.sabiosciences.com/

Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies