Integrating Scientific Literature with Large-Scale Gene Expression Analysis: A PhD Defense Overview

Integrating Scientific Literature WithLarge Scale Gene Expression Analysis PhD defense Patrick Glenisson Promotor Prof. Bart De Moor June 11th 2004

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

Overview M-score • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Cluster analysis Overview

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Literature analysis Overview

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Integrated clustering & Overview

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview

DNA, genes, proteins and cells Genes and Microarrays

DNA, genes, proteins and cells protein Genes and Microarrays

Genes are expressed and regulated Genes and Microarrays

Microarrays measure gene expression Laser excitation Sample annotations Conditions C1 .. C2 C3 Gene annotations G1 G2 Genes G3 .. Gene expressionmeasurement Genes and Microarrays

Representing expression information Conditions in which expression occurs • Gene expression experiments are complex : • Too verbose to include in a scientific publication • Too important to compromise on reproducibility • Too valuable for post-genome research to have it scattered around on various websites • Hence, standard for reporting on MA experiments • As a guideline for databases hosting expression compendia Genes and Microarrays

MIAME standard • Minimum Information About a MicroArray Experiment • Internationally proposed standard • Published in Dec 2001 by International consortium MGED • Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data • Some hurdles: • Significant overhead in filling out the questionnaire • Scooping of leads (!) • Proprietary information about probe sequences Genes and Microarrays

Questions asked with microarrays • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

Microarray data analysis • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis

Clustering Conditions C3 Genes C2 C1 Expression data Genes Genes Hierarchical clustering k - Means Distance matrix Clustering Gene expression data analysis

Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 Gene expression data analysis C1

Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring Gene expression data analysis

Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Motif-based • DNA patterns in regulatory regions of gene groups Gene Regulatory DNA patterns (motifs) Gene expression data analysis

DNA patterns in expression clusters Significant occurrences of known motifs in cluster Gene clusters Clusters 1 2 3 .. -log(p-value) A B C .. Motifs Cluster-by-Motif(motif enrichment matrix) M-score Genes expression data analysis

Cluster-by-motif matrix M-Score for the entire clustering solution one-shot estimate of the `biological relevance’ motif cluster Genes expression data analysis

M-score • A motif is less interesting when it (significantly) occurs in many clusters • A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. • A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment. Gene expression data analysis

M-score validation M-score k • Optimal kin yeast cell cycle expression data • Original studies by Tavazoie et al. used k=30 • Overestimation confirmed by analyses of • De Smet et al. (AQBC) • Gibbons et al. (GO-based scoring) • A simplification of reality • No absolute quantification of biological relevance. • Useful tool when experimenting with • Multiple clustering methods • Multiple parameterizations • To economize on biological validations Gene expression data analysis

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview

Problem setting • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

Problem setting • Given a set of genes (and their literature), • compute a representation, called gene index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles

Vector space model gene T 3 T 2 T 1 vocabulary • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • Compute index of textual resources: Text Mining: principles

Validity of gene index Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Genes that are functionally related should be close in text space: Text Mining: principles

Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles

Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview

Motivation 1 GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINEabstract and downloading an entry from a biological database ” (M. Gerstein, 2001) TXTGate - a platform to profile groups of genes

Motivation 2 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • A number of structured vocabularies have already arisen: • Gene Ontology (GO) • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO • GOA@EBI TXTGate - a platform to profile groups of genes

Motivation 3 (Figure courtesy: S. Van Vooren) TXTGate - a platform to profile groups of genes

TXTGate Distance matrix &Clustering Other vocabulary Profile TXTGate - a platform to profile groups of genes

TXTGate – a case study Two ‘new’ genes ACN9& CAT8 in module 2 • Gene modules over various expression data sets • Reported two sub modules of TCA cycle TXTGate - a platform to profile groups of genes

Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ” Fusion of text and expression data

Integration of text and data • In each information space • Appropriate preprocessing • Choice of distance measures Fusion of text and expression data

Integration of text and data • Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations. Fusion of text and expression data

Integration of text and data • However, distribution of distances invoke a bias  Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram Fusion of text and expression data

Overview meta-clustering Clustering M-score Fusion of text and expression data

Integration improves M-score Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only Fusion of text and expression data

A look inside the integration Fusion of text and expression data

A look inside the integration Text Profile Expression Profile Strongre-enforcement Fusion of text and expression data

Contributions • Representation of a gene expression experiment • MIAME • Laboratory Information Management System v. at the VIB MicroArray Facility • Gene expression analysis • Iterative clustering to determine optimal k • M-score • Text-based gene representation • To represent functional information about genes • To score gene groups based on literature • To cluster genes based on literature • TXTGate text mining application • To profile, in an flexible and interactive manner, gene groups from different ‘views’ • Integration of text and expression data in clustering Conclusion

Future work • Semantically-oriented text mining representations • Algorithm-based: • Improved phrases (word co-locations) • Latent Semantic Indexing • concept clustering, bi-clustering • Knowledge based: • Gene Ontology  distance in a taxonomy • Basic natural language processing + statistics = Shallow Parsing • Advanced ways of integrating data • Combine link information with term information • Ways to determine Conclusion

Integrating Scientific Literature with Large-Scale Gene Expression Analysis: A PhD Defense Overview

Integrating Scientific Literature with Large-Scale Gene Expression Analysis: A PhD Defense Overview

Presentation Transcript

PhD Defense 8 July 2004

Patrick St. Louis, PhD CPOCT Chair

On Systems with Limited Communication PhD Thesis Defense

Patrick

Cui Tao PhD Dissertation Defense

CogMan : Cognitive Network Management Architecture - PhD Thesis Defense -

Sarah L. Patrick, MPH, PhD State Epidemiologist

Monit Cheung, PhD, LCSW, Professor Patrick Leung, PhD, Professor University of Houston

Patrick Glenisson

Public PhD thesis defense at WU

PhD Thesis Defense Daniel Navarro Urrios

Patrick Glenisson

Hamid Mukhtar PhD Defense 16 November 2009

Patrick J. Stover, PhD

Gregory M. Reck PhD Dissertation Defense

Heather Patrick, PhD

Patrick J. Stover, PhD.: Cornell University (CU) Marie A. Caudill. PhD, RD: CU

PhD defense: Marta Garc í a

Sarah L. Patrick, MPH, PhD State Epidemiologist

PhD defense Patrick Glenisson

Patrick Glenisson

Patrick Glenisson