1 / 74

Patrick Glenisson

Integrating Scientific Literature With Large Scale Gene Expression Analysis. Patrick Glenisson. December 21th 2004. Overview. Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data

jasper
Download Presentation

Patrick Glenisson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Scientific Literature WithLarge Scale Gene Expression Analysis Patrick Glenisson December 21th 2004

  2. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

  3. Overview M-score • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Cluster analysis Overview

  4. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Literature analysis Overview

  5. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

  6. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Integrated clustering & Overview

  7. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  8. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  9. DNA, genes, proteins and cells Genes and Microarrays

  10. DNA, genes, proteins and cells protein Genes and Microarrays

  11. Genes are expressed and regulated Genes and Microarrays

  12. Microarrays measure gene expression Laser excitation Sample annotations Conditions C1 .. C2 C3 Gene annotations G1 G2 Genes G3 .. Gene expressionmeasurement Genes and Microarrays

  13. Representing expression information Conditions in which expression occurs • Gene expression experiments are complex : • Too verbose to include in a scientific publication • Too important to compromise on reproducibility • Too valuable for post-genome research to have it scattered around on various websites • Necessary level detail for reproducibility / data mining ? • Hence, standard for reporting on MA experiments • As a guideline for databases hosting expression compendia Genes and Microarrays

  14. Storing gene expression data Genes and Microarrays

  15. MIAME standard • Minimum Information About a MicroArray Experiment • Internationally proposed standard • Published in Dec 2001 by International consortium MGED • prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data • Some hurdles: • Significant overhead in filling out the questionnaire • Scooping of leads (!) • Proprietary information about probe sequences • Query-enabled >< comparable (cfr. Affy vs cDNA) Genes and Microarrays

  16. Impression on MIAME’s content Genes and Microarrays

  17. Dissemination of gene expression data publications repositories Genes and Microarrays

  18. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  19. Questions asked with microarrays • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

  20. Microarray data analysis • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

  21. Clustering Conditions C3 Genes C2 C1 Expression data Genes Genes Hierarchical clustering k - Means Distance matrix Clustering Gene expression data analysis

  22. Cluster validation Optimal number of clusters ? Define `optimal’ ? E.g. SILHOUETTE • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 Gene expression data analysis C1

  23. Cluster validation – stability method Genes and Microarrays

  24. Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring Gene expression data analysis

  25. Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Motif-based • DNA patterns in regulatory regions of gene groups Gene Regulatory DNA patterns (motifs) Gene expression data analysis

  26. DNA patterns in expression clusters ‘Significant’ occurrences of known motifs in cluster Gene clusters Clusters 1 2 3 .. -log(p-value) A B C .. Motifs Cluster-by-Motif(motif enrichment matrix) M-score Genes expression data analysis

  27. Cluster-by-motif matrix M-Score for the entire clustering solution one-shot estimate of the `biological relevance’ motif cluster Genes expression data analysis

  28. M-score • A motif is less interesting when it (significantly) occurs in many clusters • A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. • A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment. Gene expression data analysis

  29. M-score validation M-score k • Optimal kin yeast cell cycle expression data • Original studies by Tavazoie et al. used k=30 • Overestimation confirmed by analyses of • De Smet et al. (AQBC) • Gibbons et al. (GO-based scoring) • A simplification of reality • No absolute quantification of biological relevance. • Useful tool when experimenting with • Multiple clustering methods • Multiple parameterizations • To economize on biological validations Gene expression data analysis

  30. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

  31. Problem setting • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

  32. Problem setting • Given a set of genes (and their literature), • compute a representation, called gene index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

  33. Vector space model gene T 3 T 2 T 1 vocabulary • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • Compute index of textual resources: Text Mining: principles

  34. Validity of gene index Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Genes that are functionally related should be close in text space: Text Mining: principles

  35. Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles

  36. Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles

  37. Validity of gene index • “Simple word vector representations are competitive also in terms of classification task with respect to more elaborate approaches ..” • ..despite unaddressed issues such as • phrases • homonyms • neglected grammatical structureA. Seewald: Ranking for BioMinT: Investigating performance, local search and homonymy recognition. >> www.biomint.org Genes and Microarrays

  38. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

  39. Motivation 1 GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINEabstract and downloading an entry from a biological database ” (M. Gerstein, 2001) TXTGate - a platform to profile groups of genes

  40. Motivation 2 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • A number of structured vocabularies have already arisen: • Gene Ontology (GO) • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO • GOA@EBI TXTGate - a platform to profile groups of genes

  41. Motivation 3 (Figure courtesy: S. Van Vooren) TXTGate - a platform to profile groups of genes

  42. Development of text mining platform • a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. • incorporates term-based indices .. • .. and use them as a starting point • to explore the text through the eyes of different domain vocabularies • to link out to other resources by query building, or • to sub-cluster genes based on text. Genes and Microarrays

  43. Genes and Microarrays

  44. Genes and Microarrays

  45. Genes and Microarrays

  46. Illustration: sub-clustering Eisen et al. (1998) Genes and Microarrays

  47. Illustration: profiling Chaussabel et al. (2003) Genes and Microarrays

  48. TXTGate: towards closing the KD loop Distance matrix &Clustering Other vocabulary Profile TXTGate - a platform to profile groups of genes

  49. TXTGate – a case study Two ‘new’ genes ACN9& CAT8 in module 2 • Gene modules over various expression data sets • Reported two sub modules of TCA cycle TXTGate - a platform to profile groups of genes Visualize with BioLayout / LGL

  50. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

More Related