1 / 78

Carlo Colantuoni carlo@illuminatobiotech

Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015. Carlo Colantuoni carlo@illuminatobiotech.com. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm. Class Outline. Basic Biology & Gene Expression Analysis Technology

Download Presentation

Carlo Colantuoni carlo@illuminatobiotech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Inst. Of Epidemiology and Biostatistics, 2010:Gene Expression Data Analysis1:30pm – 5:00pm in Room W2015 Carlo Colantuoni carlo@illuminatobiotech.com http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

  2. Class Outline • Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

  3. Class Outline - Detailed • Basic Biology & Gene Expression Analysis Technology • The Biology of Our Genome & Transcriptome • Genome and Transcriptome Structure & Databases • Gene Expression & Microarray Technology • Data Preprocessing, Normalization, & QC • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) • Background correction (PM-MM, RMA, GCRMA) • Global Mean Normalization • Loess Normalization • Quantile Normalization (RMA & GCRMA) • Quality Control: Batches, plates, pins, hybs, washes, and other artifacts • Quality Control: PCA and MDS for dimension reduction • SVA: Surrogate Variable Analysis • Measures of Differential Expression • Basic Statistical Concepts • T-tests and Associated Problems • Significance analysis in microarrays (SAM) [ & Empirical Bayes] • Complex ANOVA’s (limma package in R) • Multiple Comparison Problem • Bonferroni • False Discovery Rate Analysis (FDR) • Differential Expression of Functional Gene Groups • Functional Annotation of the Genome • Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum • Gene Set Enrichment Analysis (GSEA) • Parametric Analysis of Gene Set Enrichment (PAGE) • geneSetTest • Notes on Experimental Design • Clustering and Classification • Hierarchical clustering • K-means • Classification • LDA (PAM), kNN, Random Forests • Cross-Validation • Additional Topics • eQTL (expression + SNPs) • Next-Gen Sequencing data: RNAseq, ChIPseq • Epigenetics? • The R Statistical Language: http://www.r-project.org/ • Bioconductor : http://www.bioconductor.org/docs/install/ • Affymetrix data processing example

  4. Questions for you: • Student’s training and experience: • Statistics or Biology • MS or MD or PhD • Student’s goals • Student’s data? • R Statistic Language? • other programming experience? • Extra topics: Student’s interests

  5. DAY #1:Genome BiologyThe TranscriptomeMicroarray Technology

  6. The Human Genome • 2 copies of the entire genome in each cell: • 3.3 billion ”bases” (Gb) • ~30K genes • millions of variants • We each get 1 copy from MOM & 1 from DAD. Each parent passes on a ”mixed copy” (from their parents). • Each copy of the genome is contained in 23 chromosomes: 22+XorY (2 copies = 46 / cell). • All in DNA! DAD MOM YOU

  7. A deoxyribonucleic acid or DNA molecule is a double-stranded polymer composed of four basic molecular units called nucleotides. • Each nucleotide contains a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T). • The two chains are held together by hydrogen bonds. • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T. • Directionality & Complementarity: Reverse Complements hybridize. DNA

  8. How do these molecular interactions influence directionality and complementarity? G-C pairs are “stickier” than A-T pairs (3 vs. 2 H-bonds). A + G = purines (2 rings) T + C + U= pyrimidines (1 ring) (T in DNA, U in RNA)

  9. Another View of DNA Where does an individual gene lie in this schematic?

  10. Another View of DNA

  11. Another View of DNA

  12. Central Dogma of Modern Cellular & Molecular Biology:

  13. Transcription From DNA to mRNA: Transcription occurs at Genes (T in DNA => U in RNA)

  14. Transcript Processing

  15. Translation From RNA to Protein: In the exons of protein coding genes (and their mRNA intermediates), each codon (3 base pairs) encodes 1 amino acid in the protein.

  16. Perspective: Biological Setup Every cell in the human body contains the entire human genome: 3.3 Gb in which ~30K genes exist. The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes. Cellular “Plans”: DNA - RNA - PROTEIN

  17. Cellular Biology, Gene Expression, and Microarray Analysis A protein-coding gene is a segment of chromosomal DNA that directs the synthesis of a protein via an mRNA intermediate. DNA RNA Protein How do we design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously.

  18. Laboratory Methods: The Genome and The Transcriptome Easy to sequence some genomic DNA. Easy to sequence some expressed mRNA’s. NOT EASY to catalogue all genomic DNA, all expressed mRNA’s, and to map out the exact relations between all these sequences.

  19. Molecular Cell Biology: Components of the Central Dogma Protein Translation protein coding START STOP mRNA AAAAA 5’ UTR 3’ UTR Transcription Genomic DNA 3.3 Gb

  20. Gene: Protein coding unit of genomic DNA with an mRNA intermediate. DNA Probe Sequence is a Necessity. START STOP mRNA AAAAA 5’ UTR 3’ UTR protein coding Transcription Genomic DNA 3.3 Gb ~30K genes

  21. Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. From Genomic DNA to mRNA Transcripts EXONS INTRONS ~30K >30K Alternative splicing Alternative start & stop sites in same RNA molecule RNA editing & SNPs Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias

  22. Designing DNA Probes From Genomic DNA Sequence Sequence & assemble the entire human genome. Search for genes predicted to produce mRNA transcripts.Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. Completeness? Design DNA probes. [ Genomic DNA databases & assembly ]

  23. Designing DNA Probes From mRNA Sequences Sequence ALL expressed mRNA molecules. Completeness? Design DNA probes.

  24. Unsurpassed as source of expressed sequence Sequence Quality! Redundancy! Completeness? Chaos?!?

  25. From Genomic DNA to mRNA Transcripts ~30K >30K >>30K

  26. Transcript-Based Gene-Centered Information

  27. From Genomic DNA to mRNA Transcripts

  28. From Genomic DNA to mRNA Transcripts

  29. DAY #1:Genome BiologyThe TranscriptomeMicroarray Technology

  30. RNA Expression Measurement: Northern Blot “target” Design + construction of labeled “probe” SAMPLE 1 SAMPLE 2 Seq DB RNA Extraction RNA 1 RNA 2 hybridization of labeled probe electrophoreric separation electrophoreric transfer to membrane

  31. RNA Expression Measurement: Northern Blot & Microarrays Probe Target Probes Target Northern Microarray Northern blots seek to interrogate the expression of ONE gene in a SINGLE hybridization reaction. Microarraysseek to interrogate the expression of MANY genes simultaneously in a MULTIPLEX hybridization reaction. SEQUENCE knowledge is REQUIRED for BOTH! Target: unknown (sample) Probe: known (synthetic)

  32. Hybridization on a Northen Blot LabeledProbe 1 1 Hybrid MANY Unlabeled Targets MEMBRANE MEMBRANE Target: unknown Probe: known Edwin Southern et al, Nature Genetics Suppl 1999

  33. Hybridization on a Microarray LabeledTarget MANY MANY Hybrids MANY Unlabeled Probes Solid Support Solid Support Target: unknown Probe: known Edwin Southern et al, Nature Genetics Suppl 1999

  34. Essentials of Microarray Experimental Design: • Probe sequence selection & design • Probe deposition on solid support • Target Labeling • Target Hybridization • Signal detection Target Probes Microarray

  35. cDNA Microarray Fabrication Bacterial clones in 96 well plates Printing onto standard glass microscope slides or nylon cDNA Microarray

  36. cDNA Microarray Experimentation Sample Standard RNA Cy5 Cy3 cDNA Hybridized Microarray Scan

  37. cDNA Microarray Scanning Cy5 Cy3 Cy5 Channel Data Cy3 Channel Data Quantification Merged Image

  38. cDNA Microarray Quantification

  39. cDNA Microarray Quantification

  40. cDNA Microarray Quantification

  41. cDNA Microarray Quantification Log Intensity Log Intensity

  42. cDNA Microarray Quantification [ ] / Log Ratio Log Intensity [ ] +

  43. Essentials of Microarray Experimental Design: • Probe sequence selection / design • Probe deposition on solid support • Target Labeling • Target Hybridization • Signal detection Target Probes Microarray

  44. Agilent (HP) Microarrays 44,000 oligonucleotides (60 NT’s) synthesized in situ using inkjet printing and solid phase phosphoramidite chemistry. 2-channel fluorescence on glass slides.

  45. NIA Microarray 10K Full Length cDNA’s Spotted on Nylon P33 One-Channel

More Related