1 / 38

Gene Anal ytics: Discovery and Contextualization of Enriched Gene Groups

Gene Anal ytics: Discovery and Contextualization of Enriched Gene Groups. N. Lavra č , I. Mozeti č, V. Podpečan, P. Kralj Novak ( Jo ž ef Stefan Institute, Ljubljana ) H. Motaln, M. Petek, K. Gruden (National Institute of Biology, Ljubljana). Talk outline.

davida
Download Presentation

Gene Anal ytics: Discovery and Contextualization of Enriched Gene Groups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Analytics: Discovery and Contextualization of Enriched Gene Groups N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana) H. Motaln, M. Petek, K. Gruden (National Institute of Biology, Ljubljana)

  2. Talk outline • Relational data mining and subgroup discovery • Semantic data mining: Using ontologies in SEGS • BISON clallenge • Experimental use case: Glioma cancer treatment • BISON methodology: combining SEGS+Biomine • Gene analytics services and future work BISON Bled WK, Aug. 2009

  3. Data Mining knowledge discovery from data Data Mining model, patterns, … data • Given:transaction data table, relational database, text • documents, Web pages • Find: aclassification model, a set of interesting patterns BISON Bled WK, Aug. 2009

  4. Subgroup discovery task definition (Kloesgen, Wrobel 1997) • Given: a population of individuals and a property of interest (e.g. AML class, in the task of finding genes differentially expressed in AML leukemia as opposed to ALL leukemia) • Find: `most interesting’ descriptions of population subgroups • are as large as possible (high target class coverage) • have most unusual distribution of the target property (high TP/FP ratio, high significance) BISON Bled WK, Aug. 2009

  5. Sample microarray analysis tasks • Two-class diagnosis problem of distinguishing between acute lymphoblastic leucemia (ALL, 27 samples) and acute myeloid leukemia (AML, 11 samples), with 34 samples in the test set. Every sample is described with gene expression values for 7129 genes. • Multi-class cancer diagnosis problem with 14 different cancer types, in total 144 samples in the training set and 54 samples in the test set. Every sample is described with gene expression values for 16063 genes. • SD results in simple IF-THEN rules, interpretable by biologists IF(KIAA0128_gene DIFF-EXPRESSED) AND(prostaglandin_d2_synthase_geneNOT-DIFF-EXP) THEN Leukemia BISON Bled WK, Aug. 2009

  6. Relational Data Mining (Inductive Logic Programming) Relational Data Mining knowledge discovery from data model, patterns, … • Given: a relational database, a set of tables. sets of logical facts, a graph, … • Find: aclassification model, a set of interesting patterns BISON Bled WK, Aug. 2009

  7. Learning from multiple tables Complex relational problems: structured data: representation of molecules and their properties in protein engineering, biochemistry, ... Semantic relational data mining Using domain ontologies as background knowledge for relational data mining Relational Data Mining (ILP) BISON Bled WK, Aug. 2009

  8. Gene Ontology (GO) • GO is a database of terms for genes: • Function - What does the gene product do? • Process - Why does it perform these activities? • Component - Where does it act? • Known genes are annotated to GO terms(www.ncbi.nlm.nih.gov) • Terms are connected as a directed acyclic graph (is_a, part_of) • Levels represent specificity of the terms 12093 biological process 1812 cellular components 7459 molecular functions BISON Bled WK, Aug. 2009

  9. Ontology encoded as relational background knowledge Prolog facts: predicate(geneID, CONSTANT). interaction(geneID, geneID). component(2532,'GO:0016020'). component(2532,'GO:0005886'). component(2534,'GO:0008372'). function(2534,'GO:0030554'). function(2534,'GO:0005524'). process(2534,'GO:0007243'). interaction(2534,5155). interaction(2534,4803). Basic, plus generalized background knowledge using GO zinc ion binding -> metal ion binding, ion binding, binding BISON Bled WK, Aug. 2009

  10. Multi-Relational representation GENE-GENEINTERACTION GENE(main table,class labels) GENE-FUNCTION GENE-PROCESS GENE-COMPONENT FUNCTION PROCESS COMPONENT is_a part_of is_a part_of is_a part_of BISON Bled WK, Aug. 2009

  11. Propositionalization in RDM Propositionalization through first-order feature construction (KARDIO 1994, LINUS 1991, RSD 2006) Novelty of SEGS (2008): Feature construction from ontology information, features with support > min_support f(7,A):-function(A,'GO:0046872'). f(8,A):-function(A,'GO:0004871'). f(11,A):-process(A,'GO:0007165'). f(14,A):-process(A,'GO:0044267'). f(15,A):-process(A,'GO:0050874'). f(20,A):-function(A,'GO:0004871'),process(A,'GO:0050874'). f(26,A):-component(A,'GO:0016021'). f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020'). f(122,A):-interaction(A,B),function(B,'GO:0004872'). f(223,A):-interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613'). f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231'). existential BISON Bled WK, Aug. 2009

  12. Gene set enrichmentanalysis with SEGS • A gene set is enriched if the genes that are members of that gene set are statistically significantly differentially expressed compared to the rest of the genes. • New gene set enrichment method: SEGS - Searching for Enriched Gene Sets (JSI) (Trajkovski et al. JBI 2008) • SEGS approach: Using GO, KEGG and ENTREZ ontologies as background knowledge for semantic subgroup discovery BISON Bled WK, Aug. 2009

  13. Ontologies • Gene Ontology (GO): standardized biological terms used to annotate gene products • Molecular Function • Biological Process • Cellular Component • Kyoto Encyclopedia of Genes and Genomes (KEGG): manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks • ENTREZ: gene annotations with GO and KO terms and gene-gene interaction data BISON Bled WK, Aug. 2009

  14. Gene i Sample j … Identifying differentially expressed genes in data preprocessing To identify genes that display a large difference in gene expression between groups (class A and class B)and are homogeneous within groups, statistical tests (e.g. t-test) and p-values (e.g. permutation test) are computed. Two sample t–statistic is used to test the equality of group means mAandmB. BISON Bled WK, Aug. 2009 14/28

  15. Ranking of differentially expressed genes The genes can be ordered in a ranked list L, according to their differential expression between the classes. The challenge is to extract meaning from this list, to describe them. The terms of the Gene Ontology were used as a vocabulary for the description of the genes. BISON Bled WK, Aug. 2009

  16. Gene expression data: Positive and negative examples for data mining fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … BISON Bled WK, Aug. 2009

  17. Ontology encoded as relational background knowledge + gene expression data fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … Prolog facts: predicate(geneID, CONSTANT). interaction(geneID, geneID). component(2532,'GO:0016020'). component(2532,'GO:0005886'). component(2534,'GO:0008372'). function(2534,'GO:0030554'). function(2534,'GO:0005524'). process(2534,'GO:0007243'). interaction(2534,5155). interaction(2534,4803). Basic, plus generalized background knowledge using GO zinc ion binding -> metal ion binding, ion binding, binding BISON Bled WK, Aug. 2009

  18. Ontology encoded as relational features + gene expression data fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … f(7,A):-function(A,'GO:0046872'). f(8,A):-function(A,'GO:0004871'). f(11,A):-process(A,'GO:0007165'). f(14,A):-process(A,'GO:0044267'). f(15,A):-process(A,'GO:0050874'). f(20,A):-function(A,'GO:0004871'),process(A,'GO:0050874'). f(26,A):-component(A,'GO:0016021'). f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020'). f(122,A):-interaction(A,B),function(B,'GO:0004872'). f(223,A):-interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613'). f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231'). BISON Bled WK, Aug. 2009

  19. Propositionalization BISON Bled WK, Aug. 2009

  20. Propositional subgroup discovery f2 and f3 [4,0] BISON Bled WK, Aug. 2009

  21. Summary: SEGS Method and Results • SEGS method: • Through semantic subgroup discovery SEGS generates candidate gene set descriptions as conjunctions of first-order features, combining individual GO, KEGG and ENTREZ terms • SEGS combines Fisher, GSEA and PAGE enrichment tests to select most interesting groups of differentially expressed genes • SEGS results: • Descriptions of subgroups of genes that are differentially expressed (e.g., belong to class DIFF-EXP of top 300 most differentially expressed genes) in contrast with RANDOM genes (randomly selected genes with low differential expression). • Sample subgroup description: diffexp(A) ;- interaction(A,B) & function(B,'GO:0004871') & process(B,'GO:0009613') BISON Bled WK, Aug. 2009

  22. SEGS implementationQuery: Results: BISON Bled WK, Aug. 2009

  23. BISON project • The challenge: Support humans to find new, interesting linksaccross domains, named bisociations • across different contexts • across different types of data and knowledge sources • Open problems: • Fusion of heterogeneous data/knowledge sources into a joint representation format - a large information network named BisoNet (consisting of nodes and relatioships between nodes) • Finding unexpected, previously unknown links between BisoNet nodes belonging to different contexts BISON Bled WK, Aug. 2009

  24. Heterogeneous data sources(BISON, M. Berthold, 2008) BISON Bled WK, Aug. 2009

  25. Bridging concepts(BISON, M. Berthold, 2008) BISON Bled WK, Aug. 2009

  26. Use Case: Glioma Cancer(investigated at NIB) Glioma a type of brain cancer different types Glioblastoma: life expectany less than 1 year Glioma treatment No efficient treatment available Testing new hypotheses for treatment: using stem cells for drug transport to the brain ? New insights in stem cell behavior and brain cancer mechanisms ? BISON Bled WK, Aug. 2009

  27. Glioma treatment Biological questions: Are stems cells efficient/effective for drug transport ? What are the risks associated? ad. Risks: Evaluation of BM-hMSC stem cells stability Biological experiments: 4 stem cell lines RNA was isolatedand sent for transcriptome analysis hMSC growth curves reveal “fast” & “slow” growing clones Slow: hMSC-1, hMSC-3 5-6 passages/6-weeks Fast: hMSC-2, hMSC-4 8 passages/6-weeks Risk of malignant transformation: Two lines (hMSC-1 and hMSC2) transformed into cancer cells Microarray analysis is performed to find groups of differentially expressed genes in several experiments: slow vs. fast growing cell lines normal vs. cancerous cell lines BISON Bled WK, Aug. 2009

  28. BISON Bled WK, Aug. 2009

  29. SEGS+Biomine Methodology Microarray: Gene sets: Contextualization, Exploratory link discovery e.g. - slow-vs-fast cell growth BISON Bled WK, Aug. 2009

  30. Biomine • In Biomine (UH) (DILS 2006), data from numerous public databases are merged into a large graph: currently consisting of 1,968,951 vertices and 7,008,607 edges. • Vertices correspond to entities and concepts • Edges represent known, annotated relationships between vertices. • A link (a relation between two entities) is manifested as a path or a subgraph connecting the corresponding vertices. • A bisociative link is a path traversing nodes belonging to different domains/contexts • In Biomine, a method for link discovery between entities in queries was developed for graph exploration. BISON Bled WK, Aug. 2009

  31. Biomine Information fusion • Biomine graph integrates numerous databases BISON Bled WK, Aug. 2009

  32. SEGS+Biomine Methodology • Biomine information fusion into a BisoNet information network • Interesting node discoveryand contextualisation with SEGS • Information fusion of GO, KEGG, ENTREZ • Identify conjunctions of concepts from different domains (ontologies) • Interesting cross=context link discovery with Biomine • create bisociative links as paths in the Biomine subgraph connecting the concepts proposed by SEGS • BisoNet Exploration/Explanation • explore BisoNet paths ranked according to weigths/probabilities (as currently implemented in Biomine) BISON Bled WK, Aug. 2009

  33. Biomine: Bisociative link discoveryQuery: Result: BISON Bled WK, Aug. 2009

  34. SEGS+Biomine Information fusion SEGS merges GO, KEGG and ENTREZ, BisoNet is used for concept visualization BISON Bled WK, Aug. 2009

  35. SEGS+Biomine Creative knowledge discovery Identify interesting concepts (BisoNet nodes) from different contexts (different databases) BISON Bled WK, Aug. 2009

  36. SEGS+Biomine Creative link discovery Create bisociative cross-context links/paths linking BisoNet concepts from different contexts BISON Bled WK, Aug. 2009

  37. SEGS+Biomine Exploration and explanation Explore and interpret most interesting cross-context BisoNet links/paths between concepts BISON Bled WK, Aug. 2009

  38. Summary • SEGS discovers interesting descriptions of differentially expressed gene groups as conjunctions of concepts from different contexts • Biomine finds cross-context links (paths) between concepts discovered by SEGS • The SEGS+Biomine approach has the potential for creative knowledge and bisociative link discovery • Preliminary results in stem cell microarray data analysis (EMBC 2009)indicate that the SEGS+Biomine methodology may lead to new insights – in vitro experiments are being planned at NIB to verify and validate the preliminary insights BISON Bled WK, Aug. 2009

More Related