1 / 57

Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies. Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management. Outline. Polymerase Chain Reaction (PCR) Genome Sequencing Microarray Pathway Analysis.

mab
Download Presentation

Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics and Personalized Care in Health SystemsLecture 10. High Throughput Technologies Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

  2. Outline • Polymerase Chain Reaction (PCR) • Genome Sequencing • Microarray • Pathway Analysis

  3. Polymerase Chain Reaction (PCR)

  4. Polymerase Chain Reaction (PCR) • A technique that allows us to generate a large number of a particular DNA sequence from an extremely small sample • Procedure: • Determine one particular sequence – the target sequence • Mix sample, primers, nucleotides to build new DNA strands • Apply cycles of heating, cooling, reheating on the mixture • The number of the target in the mixture will grow exponentially with the number of cycles • Primer selection is critical. The primers should be at least 15-20 bases to ensure specificity. • If you are unsure of the exact sequence, you can use a mixture of primers (vary at third codon position)

  5. PCR Double-stranded DNA target primers Primers are complementary to opposite ends of target seq.

  6. PCR

  7. PCR Applications • Making a lot of protein • Use RT-PCR, “reverse transcriptase” PCR, to create DNA with introns removed and then insert it into bacteria to clone the gene. e.g. to make proteins for X-ray crystallography. • Medical diagnosis • Detect HIV viral proteins long before AIDS symptoms arise • Rapid tuberculosis test • Forensics • Detect trace amounts of DNA at a crime scene • …

  8. Genome Sequencing

  9. DNA Sequencing • The process of determining the order of the nucleotide bases along a DNA strand • In 1977 two separate methods for sequencing DNA were developed: • Chain termination method (Sanger et al.) • Chemical degradation method(Maxam and Gilbert) • Both methods were equally popular to begin with, but, for many reasons, the chain termination method is the method more commonly used later • Chain termination method is based on the principle that single-stranded DNA molecules that differ in length by just a single nucleotidecan be separated from one another using polyacrylamide gel electrophoresis

  10. Chain Termination Method • Idea: If we know the distance of each type of base from a known origin, then it is possible to deduce the sequence of the DNA. • For example, if we knew that there was an: • A at positions 2, 3, 11, 13 ... G at positions 1, 12, ... C at positions 6, 7, 8, 10, 15... T at positions 4, 5, 9, 14....then we can reconstruct the sequence • Obtaining this information is conceptually simple. The idea is to cause a termination of a growing DNA chain at a known base (A,G,C or T) and at a known location in the DNA • In practice, chain termination is caused by the inclusion of a small amount of a single dideoxynucleotide base in the mixture of all four normal bases (e.g. dATP, dTTP, dCTP, dGTP and ddATP). The small amount of ddATP would cause chain termination whenever it would be incorporated into the DNA.

  11. Automatic DNA sequencing

  12. Whole Genome Shotgun Sequencing

  13. Metrics for Evaluating Sequencing Methods • Throughput • Number of high quality bases per unit time • Difficulty of sample preparation • Number of independent samples run in parallel - multiplexing • Yield • Number of useful reads per sample • Read length • Cost • Per run and per base; Equipment; Reagents; Infrastructure; Labor; Analysis • The goal of all new sequencing technologies is to increase throughput and yield while reducing cost

  14. Sanger Sequencing • Radiolabeled dideoxyNTPs • 800 bp reads • Low throughput (several kb/gel)

  15. Next Generation Sequencing • Increasing sequencing production • Massive parallelization • Reduction in per-base cost • Eliminate need for huge infrastructure • Millions of reads (>1Gb sequence per run) • Technologies • 454 • SOLiD • Illumina • … • Challenges • Read length • Quality • Data analysis

  16. 454 • Throughput & Yield • 1 million 400 bp reads/10 hour run • >8 samples/run (more with barcoding) • Cost • Machine: $500k; reagents ~$8000k/run • Issues • High indel rate in homopolymers • Longer reads but fewer than other systems

  17. Other Short Read Technologies • Illumina • Sequencing by synthesis • 100 million 36-75 bp reads/run • $6500 in reagent cost/run • 3-6 day run time • SOLiD • Sequencing by ligation • ~400 million 35-50 bp reads/run • ~$5000 in reagent cost/run • 3-6 day run time • Helicos • Sequencing by synthesis • No amplification • 750 million reads/run • $18k run cost • 8 day run time

  18. Third-Generation Sequencing • Extremely high-throughput sequencing at very low cost • Pacific Biosciences • Sequence in real time with fluorescent NTPs • Rate limited by processivity of polymerase • Very long reads (>10 kb) • Not well parallelized (few reads) • Nanopore sequencing • Sequencing by exonuclease cleavage of native DNA • Bases are read as they pass through a modified nanopore • base-specific change in current

  19. Genome Sequencing Videos • Wash U Genome Center • http://www.nslc.wustl.edu/elgin/genomics/gscmaterials.html • Sanger Technology Tour Videos • Next Generation Technology Tour Videos http://gep.wustl.edu/curriculum/course_materials_WU/introduction_to_genomics/nextgen_video_tour • Other videos • PCR: http://www.youtube.com/watch?v=eEcy9k_KsDI • Sanger: http://www.youtube.com/watch?v=aPN8LP4YxPo • SOLiD: http://www.youtube.com/watch?v=nlvyF8bFDwM • Solexa: http://www.youtube.com/watch?v=77r5p8IBwJk • Helicos: http://www.youtube.com/watch?v=TboL7wODBj4

  20. DNA Sequence Assembly

  21. Outline • Basic concepts in sequence assembly • whole-genome shotgun methods • Sources of error in assemblies • Repeats • Polymorphism • Sequencing errors • Alignment and assembly of next-generation sequencing data • Tiling reads onto reference vs. de novo assemblies

  22. Whole Genome Assembly • Multiple copies of the genome are broken into pieces • Both ends of every piece are read. • Length (and orientation) of each piece form constraints. • Reads: 500-1000 bp • Quality array for each position. • Reconstruct genome from reads and constraints. • Issues: both ends of a read usually low quality, chimeric reads, repetitive regions.

  23. DNA Sequencing Data Set • Millions of reads, some of them are low quality reads • Millions of constraints, such as paired ends, quality values • After removing repeats, if two reads overlap large enough, merge • A contig is an ordered and oriented list of overlapping reads. • A scaffold is an ordered and oriented list of contigs.

  24. Scaffolds

  25. Sequence Assembly: Basic Approach Generate reads Find overlapping reads Assemble reads into contigs Join contigs into scaffolds using mate pairs Join scaffolds into “finished” sequence

  26. Alignment and Assembly with Short Reads • Map to reference genome • Many tools • De novo assembly • Much harder • Reference-guided assembly (MOSAIK) • “True” de novo assebmly (Velvet)

  27. Many DNA Assembly Systems • PHREP • CAP • Euler • Celera Assembler • Arachne • LSA

  28. Microarray Technique

  29. Microarrays • Used to study gene expression levels in cells. • Cells can differ dramatically in the amounts of various proteins that they synthesize; e.g. due to different cell types or different external/internal conditions. • In fact, in higher level organisms only a fraction of the genes in a cell are expressed at a given time, and that subset depends on the cell type. • Via microarrays it is possible to study the expression levels of tens of thousands of genes simultaneously.

  30. Microarray Technology • A microarray is a glass slide with spots of DNA on it; each spot is a probe (or target). Thousands of probes can fit on a single slide. The slides can be spotted by robots. • The DNA is single-stranded cDNA and may consist of an entire gene or part of one • If the microarray is exposed to a solution containing mRNA, then the mRNA molecules will bind to those probes to which they are complementary • Genes you can study with a microarray depends on the collection of probes on it. • There are a number of commercial manufacturers; e.g. Affymetrix

  31. Microarray Probes Single-stranded cDNA sequences

  32. Microarray Experiments • Start with two cell types, e.g. “healthy” and “diseased”. • Isolate mRNA from each cell type, generate cDNA with fluorescent dyes attached, e.g. green for healthy and red for diseased. • Mix the cDNA samples and incubate with the microarray. • After incubation the cDNA in the samples has had a chance to bind (hybridize) with the probes on the chip. • The chip is read by a scanner that uses lasers to excite the fluorescent tags; the intensity levels of the dyes are recorded for each probe gene and stored in a computer.

  33. The Colors of a Microarray • Green: control DNA, where either DNA or cDNA derived from normal tissue is hybridized to the target DNA • Red: sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA • Yellow: a combination of control and sample DNA, where both hybridized equally to the target DNA • Black: areas where neither the control nor sample DNA hybridized to the target DNA • The location and intensity of a color can tell us whether the gene, or mutation, is presented in either the control and/or sample DNA • It may also provide estimate of the expression level of the gene(s) in the sample and control DNA

  34. Microarray Data Representation • Microarray data is often arranged in an n x m matrix M with rows for n genes and columns for m biological samples in which gene expression has been monitored. • mijis the expression level of gene i in sample j. • A row ei is the gene expression pattern of gene i over all the samples. • A column sj is the expression level of all genes in a sample j and is called the sample expression pattern

  35. Microarray Data Analysis • Gene chips allow the simultaneous monitoring of the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data • Statistical hypothesis tests for differential expression analysis • Principal component analysis and other methods for visualizing high-dimensional microarray data • Cluster analysis for grouping together genes or samples with similar expression patterns • Different clustering algorithms may be used, e.g. hierarchical with different metrics, or k-means, k-medians. • Hidden Markov models, neural networks and other classifiers for predictively classifying sample expression patters as one of several types (diseased vs. normal)

  36. For What Do We Use Microarray Data Genes with unusual expression levels in a sample Genes whose expression levels vary across samples This can be used to compare normal and diseased tissues or diseased tissue before and after treatment. Samples that have similar expression patterns This can also be used to compare normal and diseased tissues or diseased tissue before and after treatment. Tissues that might be diseased We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue.

  37. Statistical Methods Can Help • Data Pre-processing • Normalization: rescaling data from different microarrays so that they can be compared • Center: subtracting the mean and dividing by the variance. • Data Visualization • Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions. This allows us to visualize it. • Cluster Analysis • By associating genes with similar expression patterns, we might be able to draw conclusions about their functional expression. • Statistical Inference • This is the formulation and statistical testing of a hypothesis and alternative hypothesis. • Classifiers for the Data • We can construct classes from data, such a diseased vs. non-diseased tissue. We can build a model that fits know data for the different classes. This can the be used to classify previously unclassified data.

  38. Measuring Dissimilarity of Expression Data We might want to compare two or more gene or sample expression patterns This might be used to differentiate between diseased and normal cells or finding out the genetic similarity of tissues. To do this we need a distance metric or a dissimilarity measure.

  39. Example Distance Metric Euclidean Distance-This is the most common distance measure. This should not be used if either Not all components of the vectors being compared have equal weight. There is missing data. Preprocessing the data can often alleviate these problems. We can also use the normalized Euclidean distance

  40. Cluster Analysis of Microarray Data Hierarchical Clustering-Assume each data point is in a singleton cluster. Find the two clusters that are closest together. Combine these to form a new cluster. Compute the distance from all clusters to new cluster using some form of averaging. Find the two closest clusters and repeat. K-Means Clustering: partitions the data into k clusters and finds cluster means for each cluster. Usually, the number of clusters k is fixed in advance. To choose k something must be know about the data. There might be a range of possible k values. To decide which is best, optimization of a quantity that maximizes cluster tightness i.e. minimizes distances between points in a cluster

  41. Challenges in Microarray Analysis • Different platforms • Ilumina, Affymetrix, Agilent…. • Many file types, many data formats • Need to learn platform dependent methods and software required • Analysis • How to get started? • Which methods? Which software? • Many freely available tools. Some commercial • How to interpret results

  42. Public Databases • Many sources for public data – labs, consortia, government • Publications require that data files including raw files be made public • GEO • http://www.ncbi.nlm.nih.gov/geo/ • Array Express • http://www.ebi.ac.uk/arrayexpress/#ae-main[0]

  43. Data Analysis • Class discovery • Class comparison • Class prediction • Biological annotation • Pathway analysis

  44. Hierarchical Clustering • Eisen Cluster and Treeview • http://rana.lbl.gov/EisenSoftware.htm • Import data • Filter • Filter or not to filter, %P calls, SD etc • Adjust data • Log transform, center, normalize • Clustering • Cluster array or genes • Computationally intensive • Choose distance metric • .cdt file created • Open with Treeview

  45. Cluster from Microarray Data

  46. Experimental Design • Sample size • How many samples in test and control • Replicates • Technical vs. biological • Biological replicates is more important for more heterogeneous samples • Need replicates for statistical analysis • All experimental steps from sample acquisition to hybridization • Microarray experiments are very expensive. So, plan experiments carefully

  47. Video on YouTube • DNA Microarray • http://www.youtube.com/watch?v=VNsThMNjKhM&

  48. Pathway Analysis

  49. KEGG • Kyoto Encyclopedia of Genes and Genomes (KEGG) http://www.genome.jp/kegg/pathway.html

  50. Biological Pathways http://www.sabiosciences.com/

More Related