1 / 62

PhD defense Patrick Glenisson

Integrating Scientific Literature With Large Scale Gene Expression Analysis. PhD defense Patrick Glenisson. Promotor Prof. Bart De Moor. June 11 th 2004. Overview. Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate

Download Presentation

PhD defense Patrick Glenisson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Scientific Literature WithLarge Scale Gene Expression Analysis PhD defense Patrick Glenisson Promotor Prof. Bart De Moor June 11th 2004

  2. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

  3. Overview M-score • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Cluster analysis Overview

  4. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Literature analysis Overview

  5. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

  6. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Integrated clustering & Overview

  7. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  8. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  9. DNA, genes, proteins and cells Genes and Microarrays

  10. DNA, genes, proteins and cells protein Genes and Microarrays

  11. Genes are expressed and regulated Genes and Microarrays

  12. Microarrays measure gene expression Laser excitation Sample annotations Conditions C1 .. C2 C3 Gene annotations G1 G2 Genes G3 .. Gene expressionmeasurement Genes and Microarrays

  13. Representing expression information Conditions in which expression occurs • Gene expression experiments are complex : • Too verbose to include in a scientific publication • Too important to compromise on reproducibility • Too valuable for post-genome research to have it scattered around on various websites • Hence, standard for reporting on MA experiments • As a guideline for databases hosting expression compendia Genes and Microarrays

  14. MIAME standard • Minimum Information About a MicroArray Experiment • Internationally proposed standard • Published in Dec 2001 by International consortium MGED • Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data • Some hurdles: • Significant overhead in filling out the questionnaire • Scooping of leads (!) • Proprietary information about probe sequences Genes and Microarrays

  15. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  16. Questions asked with microarrays • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

  17. Microarray data analysis • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

  18. Clustering Conditions C3 Genes C2 C1 Expression data Genes Genes Hierarchical clustering k - Means Distance matrix Clustering Gene expression data analysis

  19. Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 Gene expression data analysis C1

  20. Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring Gene expression data analysis

  21. Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Motif-based • DNA patterns in regulatory regions of gene groups Gene Regulatory DNA patterns (motifs) Gene expression data analysis

  22. DNA patterns in expression clusters Significant occurrences of known motifs in cluster Gene clusters Clusters 1 2 3 .. -log(p-value) A B C .. Motifs Cluster-by-Motif(motif enrichment matrix) M-score Genes expression data analysis

  23. Cluster-by-motif matrix M-Score for the entire clustering solution one-shot estimate of the `biological relevance’ motif cluster Genes expression data analysis

  24. M-score • A motif is less interesting when it (significantly) occurs in many clusters • A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. • A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment. Gene expression data analysis

  25. M-score validation M-score k • Optimal kin yeast cell cycle expression data • Original studies by Tavazoie et al. used k=30 • Overestimation confirmed by analyses of • De Smet et al. (AQBC) • Gibbons et al. (GO-based scoring) • A simplification of reality • No absolute quantification of biological relevance. • Useful tool when experimenting with • Multiple clustering methods • Multiple parameterizations • To economize on biological validations Gene expression data analysis

  26. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

  27. Problem setting • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

  28. Problem setting • Given a set of genes (and their literature), • compute a representation, called gene index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

  29. Vector space model gene T 3 T 2 T 1 vocabulary • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • Compute index of textual resources: Text Mining: principles

  30. Validity of gene index Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Genes that are functionally related should be close in text space: Text Mining: principles

  31. Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles

  32. Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles

  33. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

  34. Motivation 1 GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINEabstract and downloading an entry from a biological database ” (M. Gerstein, 2001) TXTGate - a platform to profile groups of genes

  35. Motivation 2 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • A number of structured vocabularies have already arisen: • Gene Ontology (GO) • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO • GOA@EBI TXTGate - a platform to profile groups of genes

  36. Motivation 3 (Figure courtesy: S. Van Vooren) TXTGate - a platform to profile groups of genes

  37. TXTGate Distance matrix &Clustering Other vocabulary Profile TXTGate - a platform to profile groups of genes

  38. TXTGate – a case study Two ‘new’ genes ACN9& CAT8 in module 2 • Gene modules over various expression data sets • Reported two sub modules of TCA cycle TXTGate - a platform to profile groups of genes

  39. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  40. Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ” Fusion of text and expression data

  41. Integration of text and data • In each information space • Appropriate preprocessing • Choice of distance measures Fusion of text and expression data

  42. Integration of text and data • Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations. Fusion of text and expression data

  43. Integration of text and data • However, distribution of distances invoke a bias  Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram Fusion of text and expression data

  44. Overview meta-clustering Clustering M-score Fusion of text and expression data

  45. Integration improves M-score Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only Fusion of text and expression data

  46. A look inside the integration Fusion of text and expression data

  47. A look inside the integration Text Profile Expression Profile Strongre-enforcement Fusion of text and expression data

  48. Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

  49. Contributions • Representation of a gene expression experiment • MIAME • Laboratory Information Management System v. at the VIB MicroArray Facility • Gene expression analysis • Iterative clustering to determine optimal k • M-score • Text-based gene representation • To represent functional information about genes • To score gene groups based on literature • To cluster genes based on literature • TXTGate text mining application • To profile, in an flexible and interactive manner, gene groups from different ‘views’ • Integration of text and expression data in clustering Conclusion

  50. Future work • Semantically-oriented text mining representations • Algorithm-based: • Improved phrases (word co-locations) • Latent Semantic Indexing • concept clustering, bi-clustering • Knowledge based: • Gene Ontology  distance in a taxonomy • Basic natural language processing + statistics = Shallow Parsing • Advanced ways of integrating data • Combine link information with term information • Ways to determine Conclusion

More Related