1 / 55

The Bioinformatics of Microarrays

The Bioinformatics of Microarrays. Microarray Outreach Team Fall 2005. Outline. Biology, Statistics, Data mining common term definitions Transcriptome caveats and limitations Experimental Design Scan to intensity measures Low level analysis Data mining – how to interpret > 6000 measures

dom
Download Presentation

The Bioinformatics of Microarrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Bioinformatics of Microarrays Microarray Outreach Team Fall 2005

  2. Outline • Biology, Statistics, Data mining common term definitions • Transcriptome caveats and limitations • Experimental Design • Scan to intensity measures • Low level analysis • Data mining – how to interpret > 6000 measures • Databases • Software • Techniques • Comparing to prior HT studies, across platforms? Issues

  3. Bioinformatics, Computational Biology, Data Mining • Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. • Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. • Genomes (viruses, bacteria, fungi, plants, insects,…) • Proteins and Proteomes • Biological Sequences • Molecular Function and Structure • Data Mining is searching for knowledge in data • Knowledge mining from databases • Knowledge extraction • Data/pattern analysis • Data dredging • Knowledge Discovery in Databases (KDD)

  4. Basic Terms in Biology Example: • The human body contains ~100 trillion cells • Inside each cell is a nucleus • Inside the nucleus are two complete sets of the human genome (except in egg, sperm cells and blood cells) • Each set of genomes includes 30,000-80,000 genes on the same 23 chromosomes • Gene – A functional hereditary unit that occupies a fixed location on a chromosome, has a specific influence on phenotype, and is capable of mutation. • Chromosome – A DNA containing linear body of the cell nuclei responsible for determination and transmission of hereditary characteristics

  5. Basic Terms in Data Mining • Data Mining:A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. • Knowledge Discovery Process: The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations. • A pattern is a conservative statement about a probability distribution. • Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution

  6. Problems in Bioinformatics Domain • Data production at the levels of molecules, cells, organs, organisms, populations • Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … • Prediction of Molecular Function and Structure • Computational biology: synthesis (simulations) and analysis (machine learning)

  7. Subcellular Localization, Provides a simple goal for genome-scale functional prediction Determine how many of the ~6000 yeast proteins go into each compartment

  8. Subcellular Localization, a standardized aspect of function Cytoplasm Nucleus Membrane ER Extra-cellular[secreted] Golgi Mitochondria

  9. "Traditionally" subcellular localization is "predicted" by sequence patterns Cytoplasm NLS Nucleus Membrane TM-helix ER HDEL Extra-cellular[secreted] Golgi Import Sig. Mitochondria Sig. Seq.

  10. Subcellular localization is associated with the level of gene expression [Expression Level in Copies/Cell] Cytoplasm Nucleus Membrane ER Extra-cellular[secreted] Golgi Mitochondria

  11. Combine Expression Information & Sequence Patterns to Predict Localization [Expression Level in Copies/Cell] Cytoplasm NLS Nucleus Membrane TM-helix ER HDEL Extra-cellular[secreted] Golgi Import Sig. Mitochondria Sig. Seq.

  12. Major Objective: Discover a comprehensive theory of life’s organization at the molecular level • The major actors of molecular biology: the nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA) • The central dogma of molecular biology??? Proteins are very complicated molecules with 20 different amino acids.

  13. Dynamic Nature of Yeast Genome eORF= essential kORF= known hORF= homology identified shORF= short tORF= transposon identified qORF= questionable dORF= disabled First published sequence claimed 6274 genes– a # that has been revised many times, why?

  14. The Affy detection oligonucleotide sequences are frozen at the time of synthesis, how does this impact downstream data analysis?

  15. Microarray Data Process • Experimental Design • Image Analysis – raw data • Normalization – “clean” data • Data Filtering – informative data • Model building • Data Mining (clustering, pattern recognition, et al) • Validation

  16. Experimental Design A good microarray design has 4 elements • A clearly defined biological question or hypothesis • Treatment, perturbation and observation of biological materials should minimize systematic bias • Simple and statistically sound arrangement that minimizes cost and gains maximal information • Compliance with MIAME • The goal of statistics is to find signals in a sea of noise • The goal of exp. design is to reduce the noise so signals can be found with as small a sample size as possible

  17. Observational Study vs. Designed Experiment • Observational study- • Investigator is a passive observer who measures variables of interest, but does not attempt to influence the responses • Designed Experiment- • Investigator intervenes in natural course of events What type is our DMSO exp?

  18. Experimental Replicates • Why? • In any exp. system there is a certain amount of noise—so even 2 identical processes yield slightly different results • Sources? • In order to understand how much variation there is it is necessary to repeat an exp a # of independent times • Replicates allow us to use statistical tests to ascertain if the differences we see are real

  19. Technical vs. Biological Replicates As we progress from the starting material to the scanned image we are moving from a system dominated by biological effects through one dominated by chemistry and physics noise Within Affy platform the dominant variation is usually of a biological nature thus best strategy is to produce replicates as high up the experimental tree as possible

  20. From probe level signals to gene abundance estimates

  21. From probe level signals to gene abundance estimates The job of the expression summary algorithm is to take a set of Perfect Match (PM) and Mis-Match (MM) probes, and use these to generate a single value representing the estimated amount of transcript in solution, as measured by that probeset. To do this, .DAT files containing array images are first processed to produce a .CEL file, which contains measured intensities for each probe on the array. It is the .CEL files that are analysed by the expression calling algorithm.

  22. PM and MM Probes • The purpose of each MM probe is to provide a direct measure of background and stray-signal(perhaps due to cross-hybridisation) for its perfect-match partner. In most situations the signal from each probepair is simply the difference PM - MM. • For some probepairs, however, the MM signal is greater than the PM value; we have an apparently impossible measure of background.

  23. Signal Intensity • Following these calculations, the MAS5 algorithm now has a measure of the signal for each probe in a probeset. • Other algortihms, ex RMA, GCRMA, dCHIP and others have been developed by academic teams to improve the precision and accuracy of this calculation • In our Exp we will use RMA and GCRMA

  24. Low level data analysis / pre-processing • Varying biological or cellular composition among sample types. • Differences in sample preparation, labeling or hybridization • Non specific cross-hybridization of target to probes. • Lead to systemic differences between individual arrays GMC scientists Scott Scott Anjie • Raw Data Quality Control • Scaling • Normalization and filtering. Anjie GeneSpring, R-language, Bioconductorb GMC scientists + entire UVM outreach team

  25. Data processing is completed now what?

  26. Overview of Microarray Problem Biology Application Domain Validation Data Analysis Microarray Experiment Image Analysis Data Mining Experiment Design and Hypothesis Data Warehouse Artificial Intelligence (AI) Knowledge discovery in databases (KDD) Statistics

  27. Back to Biology • Do the changes you see in gene expression make sense BIOLOGICALLY? • How do we know? • If they don’t make sense, can you hypothesize as to why those genes might be changing? • Leads to many, many more experiments

  28. The Gene Ontologies A Common Language for Annotation of Genes from Yeast, Flies and Mice …and Plants and Worms …and Humans …and anything else!

  29. Gene Ontology Objectives • GO represents concepts used to classify specific parts of our biological knowledge: • Biological Process • Molecular Function • Cellular Component • GO develops a common language applicable to any organism • GO terms can be used to annotate gene products from any species, allowing comparison of information across species

  30. Sriniga Srinivasan, Chief Ontologist, Yahoo! The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996

  31. The 3 Gene Ontologies • Molecular Function = elemental activity/task • the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity • Biological Process = biological goal or objective • broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component= location or complex • subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme

  32. Example: Gene Product = hammer Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’sjuggling object Entertainment

  33. Biological Examples Biological Process Molecular Function Biological Process Molecular Function Cellular Component Cellular Component

  34. Terms, Definitions, IDs term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO:0007244 definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. definition_reference: PMID:9561267 comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'. definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces

  35. SGD

More Related